To make an algorithm work faster

pianistseb · (This post was last modified: Mar-28-2019, 06:58 PM by pianistseb.)

I am using biopython for dna sequences. I am new in this python library. I have a .fasta file that has a 4-letters dna code, and I want to convert it in 2-letters purines and pyrimidines binary code. So I merge all the segments/records of the .fasta file and I take the full_sequence of 4-letters alphabet. Then I have to convert this alphabet into two letters alphabet new_sequence. And here is the problem! When I am doing the conversion it takes hours to run. The sequence's length is 119750280, so it's a very long sequence. Any ideas to make my program run faster?

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# merge all the records

full_seq=Seq("")

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    full_seq+=seq_record.seq

# convert the 4-letters alphabet into binary alphabet

new_seq=Seq("")

for i in range(0,len(full_seq)):
    if (full_seq[i]=="A") or (full_seq[i]=="G"):
        new_seq+=Seq("-")
    else:
        new_seq+=Seq("+")

print("Binary sequence", repr(new_seq))

woooee · (This post was last modified: Mar-28-2019, 07:41 PM by woooee.)

You can see if this helps. Your code has to find the offset in the list each time, so if the offset is 10,000, it has to start at the beginning of the list and move forward to the 10,000 record, and then do it all over again for 10,001. This is not terrible for 10,000 records, but you have millions so it does have an effect. The other option is to break full_seq into smaller bites and then combine the resulting lists.

for rec in full_seq:  ## assumes full_seq is iterable
    if rec.startswith(("A", "G")):

pianistseb · Apr-01-2019, 07:54 AM

I finally found that a very fast way to do it is to use something like that:

for seq_record in SeqIO.parse("OMOK01.fasta", "fasta"):
    new_str=str(seq_record.seq).replace("A","+");
    new_str=new_str.replace("G","+");
    new_str=new_str.replace("C","-");
    new_str=new_str.replace("T","-");

**Gribouillis** · (This post was last modified: Apr-01-2019, 08:47 AM by Gribouillis.)

You can also try

table = {ord(k): ord(v) for k, v in {'A': '+', 'G': '+', 'C': '-', 'T': '-'}.items()}
new_str = new_str.translate(table)

or

import re
table = {'A': '+', 'G': '+', 'C': '-', 'T': '-'}
new_str = re.sub(r'[AGCT]', lambda m: table[m.group()], new_str)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Make code run faster: point within polygon lookups	Bennygib	3	1,475	Jul-11-2025, 07:24 AM Last Post: FrankBuckland
	How can I make this code more efficient and process faster?	steven_ximen	0	859	Dec-17-2024, 04:27 PM Last Post: steven_ximen
	Trying to Make Steganography Program Work For All Payload Types	Stegosaurus	0	2,167	Sep-26-2024, 12:43 PM Last Post: Stegosaurus
	How to make my Telegram bot stop working at 16:15 and not work on Fridays?	hus73	2	2,628	Aug-10-2024, 12:06 PM Last Post: hus73
	hi need help to make this code work correctly	atulkul1985	5	2,843	Nov-20-2023, 04:38 PM Last Post: deanhystad
	newbie question - can't make code work	tronic72	2	2,162	Oct-22-2023, 09:08 PM Last Post: tronic72
	Why do I have to repeat items in list slices in order to make this work?	Pythonica	7	4,154	May-22-2023, 10:39 PM Last Post: ICanIBB
	Make my py script work only on 1 compter	tomtom	14	8,053	Feb-20-2022, 06:19 PM Last Post: DPaul
	Cannot make 'pandas' module to work...	ellie145	2	5,911	Jan-05-2021, 09:38 PM Last Post: ellie145
	Is there anyway to make this work?	dre	3	3,765	Nov-26-2020, 12:40 PM Last Post: jefsummers

To make an algorithm work faster

User Panel Messages

Announcements