Apr-14-2019, 07:42 AM
I am using biopython for dna sequences. I want to make an algorithm which searches a big motif (about 1000 letters). Because it is impossible to find this 1000-letters motif, I accept also motifs which have 30% error. So I combine every 1000 letters chromosome subsequence of the whole dna sequence, with this motif and I compute how many letters are different. The problem is that the algorithm is too slow. It needs about one day to run. I work in the binary purines-pyrimidines alphabet. Generally speaking I think that python is too slow to deal with big strings or lists about 1 million or billion letters lenght. Why?
I am running it in linux terminal, because I want the maximum speed. Is really C faster than python?
Here is my code:
I am running it in linux terminal, because I want the maximum speed. Is really C faster than python?
Here is my code:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
for seq_record in SeqIO.parse("CM000665.fasta", "fasta"):
new_str=str(seq_record.seq).replace("A","1");
new_str=new_str.replace("G","1");
new_str=new_str.replace("C","0");
new_str=new_str.replace("T","0");
# Big Erdos Input
big_Erdos1=list("01101001101101001001101001001101001101101011001101000101101001001111001001101001101100001101101011001101010001101001101101001101001000101101001101101001001101101001101001110101000011101001001110001101101001001110010101001011101101001001101001101101001001100111000111010001101001100101001011101001101101001101001001101001100101101010010101101001101001001101010110101011000101001111001000110101101001101001001101001111001000101111001001101101001101001001101001101101001101001001101101000011101001100111000101101001011000101101101001001100111101001001001111001001110001101101001100101001001101101001001011101101001001100101101101001001001101101001110001101001101100011100011011000011100101101001001101001101101011001100110000111011001100001101110011000111000010101101101101010000111001101001101101001001001101001101101001100101001010101111000101011001001010101101011100101001001011100101110001101001001101100110001101011001101000101101001011001101001101001011101001100101011100101001101101001001011000101101011001101001110100001100101101011001101001101011001101000101101001110000101011001101001101101101001001001111000110001010110101110010110010100101011000111011")
big_Erdos2=""
for i in range(0,len(big_Erdos1)):
if big_Erdos1[i]=="0":
big_Erdos2+="1"
else:
big_Erdos2+="0"
big_Erdos2=list(big_Erdos2)
chromosome=list(new_str)
# Search big Erdos erratta
big_erdos_number=0
file = open("results30.txt", "w")
file.write("Erdos block positions:")
for i in range(0,len(chromosome)-len(big_Erdos1)):
S1=0
S2=0
for j in range(0,len(big_Erdos1)):
if chromosome[j+i]!=big_Erdos1[j]:
S1+=1
for j in range(0,len(big_Erdos2)):
if chromosome[j+i]!=big_Erdos2[j]:
S2+=1
error1=S1/len(big_Erdos1)
error2=S2/len(big_Erdos2)
#print(error)
if error1<0.3 or error2<0.3:
big_erdos_number+=1
print("erdos block position:",i)
file.write("%d,"%i)
#print("progress:",i/(len(chromosome)-len(big_Erdos1))*100,"%")
file.write("\n")
file.write("Big erdos number:%d"%big_erdos_number)
print("Big erdos number:",big_erdos_number)
file.close()
