Computing GC Content

uwl · (This post was last modified: Jun-25-2023, 10:06 PM by uwl.)

I have been trying to find the solution to a programming problem which asks me to find the GC content. The example online displays:

Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output
Rosalind_0808
60.919540

My code runs properly and I already tried entering the solutions I find with the correct amount of decimal place values.

from collections import Counter
myDna = ''
with open('rosalind_gc.txt', 'r') as data:
     for line in data:
          if '>' in line:
               continue
          myDna += line.strip()
myNucleotideCounts = Counter(myDna)
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))
print('Dna GC Content = {0}'.format(myGC))

Would someone be able to provide suggestions as to what might be the problem and how to fix this in the above code?

Pedroski55 · Jun-26-2023, 12:03 AM

I'm not familiar with collections, so I didn't use that module here.

Apparently, as I've been told here, creating a string myDNA = '' and then appending to it actually creates many strings, because a string is immutable, hangs around in memory. So it is better to use a list, then join the list!

Anyway, this may give you some ideas:

#! /usr/bin/python3
import random

# play God, make DNA, make a Python!
def makeDNA():
    myDNA = []
    nucleotides = ['A', 'C', 'G', 'T']
    length = input('How long do you want your DNA? Just enter a number. ')
    num = int(length)
    for i in range(0, num):
        choice = random.choice(nucleotides)
        myDNA.append(choice)
    mystring = ''.join(myDNA)
    return mystring

def countNucs(DNA_sample):
    countC = 0
    countG = 0
    for nuc in DNA_sample:    
        if nuc == 'C':
            countC +=1
        elif nuc == 'G':
                countG +=1
    print('Cytosine was found', countC, 'times in this sample of DNA.')
    print('Guanine was found', countG, 'times in this sample of DNA.')

if __name__ == "__main__":
    DNA_sample = makeDNA()
    countNucs(DNA_sample)

bowlofred · (This post was last modified: Jun-26-2023, 04:09 PM by bowlofred.)

(Jun-25-2023, 10:04 PM)uwl Wrote: which asks me to find the GC content

Is that the entire specification of the problem? It looks to me like it wants you to examine the GC content of different files and only print the filename and GC content in some situations (perhaps with a GC content greater than some threshold?)

Your program is calculating a single value for all lines in the file. It doesn't display the fiiename like the sample output does.

Pedroski55 · Jun-26-2023, 09:37 PM

It occurred to me this morning that you may want to count the number of times the sequence GC appears in the DNA sample.

def countGC(DNA):
    countGC = 0    
    for i in range(len(DNA)):    
        if DNA[i] == 'G' and DNA[i+1] == 'C':
            countGC +=1        
    print('The sequence GC was found', countGC, 'times in this sample of DNA.')

For a DNA sample 501 long that gave:

Quote:countGC(DNA_sample)
The sequence GC was found 31 times in this sample of DNA.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Matlab to Python -- Parallel Computing	zistambo	1	3,332	Jun-10-2020, 04:59 PM Last Post: pyzyx3qwerty
	Computing correlation in audio files	ryanblumenow	0	4,320	Jan-15-2020, 06:11 PM Last Post: ryanblumenow
	computing entropy using pickle files	baran01	2	3,994	Dec-30-2019, 09:45 PM Last Post: micseydel

Computing GC Content

User Panel Messages

Announcements