Python GC计数器 - Rosalind

时间:2016-01-31 18:58:43

标签: python bioinformatics dna-sequence

我试图编写一个程序来计算一系列序列(以fasta格式输入)中的GC内容,然后返回具有最高百分比和GC百分比的序列的名称。 根据{{​​3}}

我终于停止收到错误消息,但我的代码似乎没有做任何事情。有谁知道为什么会这样?

#Define functions
#Calculate GC percentage 
def Percent(sequence):
G_count = sequence.count ('G')
C_count = sequence.count ('C')
Total_count = len(sequence)
GC_Sum = int(G_count) + int(C_count)
Percent_GC = GC_Sum / Total_count
Per_GC = (Percent_GC)*100
return Per_GC

Input = input ("Input Sequence")

#Fasta file into dictionary
fasta_dictionary = {}
sequence_name = ""
for line in Input:
    line = line.strip()
    if not line:
        continue
    if line.startswith(">"):
        sequence_name = line[1:]
        if sequence_name not in fasta_dictionary:
            fasta_dictionary[sequence_name] = []
        continue
    sequence = line
    fasta_dictionary[sequence_name].append(sequence)

#Put GC values for each sequence into dictionary
dictionary = dict()
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(sequence))

#Find highest
for sequence_name, sequence in fasta_dictionary.items():
    inverse = [(sequence, sequence_name) for sequence_name, sequence in dictionary.items()]
    highest_GC = max(inverse)[1]  

#Find sequence name
for sequence_name, sequence in fasta_dictionary.items():
        if sequence == highest_GC:
            print ((sequence_name) + ' ' + (highest_GC))

2 个答案:

答案 0 :(得分:1)

所以,Pier Paolo正确地将第一行更改为with open()并将其余代码缩进到下面。

with open('/path/to/your/fasta.fasta', 'r') as Input:
   fasta_dictionary = {}

他在分组上也是正确的 - 这应该有助于你的Percent功能。 Percent_GC = float(GC_Sum) / Total_count

不要追加,只需将sequence指定为字符串。

sequence = line
fasta_dictionary[sequence_name] = sequence

接下来,当您呼叫Percent功能时,在您退出for循环后,您正在传递sequence,您将迭代地定义每个sequence。您将它们存储在名为fasta_dictionary的字典中,因此请更改此代码。

for sequence_name in fasta_dictionary:
        dictionary[sequence_name] = float(Percent(fasta_dictionary[sequence_name]))

最后,最后,您正在检查if sequence == highest_GC:。这是您目前正在检查的内容:

for sequence_name, sequence in fasta_dictionary.items():
            print sequence

打印str实际序列数据。

'ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC'

highest_GC是"名称"序列

>sequence1

将其更改为阅读if sequence_name == highest_GC

使用上述更改运行代码始终打印具有最高GC内容%的序列的名称。还有很多其他不必要的步骤和重复的代码,但希望这可以让你开始。祝你好运!

答案 1 :(得分:0)

GC问题的另一个解决方案是在python中使用Counter高阶数据结构。它可以为您自动设置和计算您的核苷酸,这样您就可以直接询问数字来计算如下:

from collections import Counter

#set a var to hold your dna
myDna = ''
#open your Dna fasta
with open('myFasta', 'r') as data:
     for line in data:
          if '>' in line:
               continue
          myDna += line.strip()

#Now count your dna
myNucleotideCounts = Counter(myDna)

#calculate GC content
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))

print('Dna GC Content = {0}'.format(myGC))