我试图编写一个程序来计算一系列序列(以fasta格式输入)中的GC内容,然后返回具有最高百分比和GC百分比的序列的名称。 根据{{3}}
我终于停止收到错误消息,但我的代码似乎没有做任何事情。有谁知道为什么会这样?
#Define functions
#Calculate GC percentage
def Percent(sequence):
G_count = sequence.count ('G')
C_count = sequence.count ('C')
Total_count = len(sequence)
GC_Sum = int(G_count) + int(C_count)
Percent_GC = GC_Sum / Total_count
Per_GC = (Percent_GC)*100
return Per_GC
Input = input ("Input Sequence")
#Fasta file into dictionary
fasta_dictionary = {}
sequence_name = ""
for line in Input:
line = line.strip()
if not line:
continue
if line.startswith(">"):
sequence_name = line[1:]
if sequence_name not in fasta_dictionary:
fasta_dictionary[sequence_name] = []
continue
sequence = line
fasta_dictionary[sequence_name].append(sequence)
#Put GC values for each sequence into dictionary
dictionary = dict()
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(sequence))
#Find highest
for sequence_name, sequence in fasta_dictionary.items():
inverse = [(sequence, sequence_name) for sequence_name, sequence in dictionary.items()]
highest_GC = max(inverse)[1]
#Find sequence name
for sequence_name, sequence in fasta_dictionary.items():
if sequence == highest_GC:
print ((sequence_name) + ' ' + (highest_GC))
答案 0 :(得分:1)
所以,Pier Paolo正确地将第一行更改为with open()
并将其余代码缩进到下面。
with open('/path/to/your/fasta.fasta', 'r') as Input:
fasta_dictionary = {}
他在分组上也是正确的 - 这应该有助于你的Percent
功能。 Percent_GC = float(GC_Sum) / Total_count
不要追加,只需将sequence
指定为字符串。
sequence = line
fasta_dictionary[sequence_name] = sequence
接下来,当您呼叫Percent
功能时,在您退出for循环后,您正在传递sequence
,您将迭代地定义每个sequence
。您将它们存储在名为fasta_dictionary
的字典中,因此请更改此代码。
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(fasta_dictionary[sequence_name]))
最后,最后,您正在检查if sequence == highest_GC:
。这是您目前正在检查的内容:
for sequence_name, sequence in fasta_dictionary.items():
print sequence
打印str
实际序列数据。
'ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC'
和highest_GC
是"名称"序列
>sequence1
将其更改为阅读if sequence_name == highest_GC
使用上述更改运行代码始终打印具有最高GC内容%的序列的名称。还有很多其他不必要的步骤和重复的代码,但希望这可以让你开始。祝你好运!
答案 1 :(得分:0)
GC问题的另一个解决方案是在python中使用Counter高阶数据结构。它可以为您自动设置和计算您的核苷酸,这样您就可以直接询问数字来计算如下:
from collections import Counter
#set a var to hold your dna
myDna = ''
#open your Dna fasta
with open('myFasta', 'r') as data:
for line in data:
if '>' in line:
continue
myDna += line.strip()
#Now count your dna
myNucleotideCounts = Counter(myDna)
#calculate GC content
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))
print('Dna GC Content = {0}'.format(myGC))