我正在尝试从输入文件中获取DNA序列,并使用循环计算单个A的T的C和G的数量来计算它们,如果有非“ATCG”字母,我需要打印“错误”For例如我的输入文件是:
SEQ1 AAAGCGT SEQ2 aa tGcGt t SEQ3 af GtgA cCTg
我提出的代码是:
acount = 0
ccount = 0
gcount = 0
tcount = 0
for line in input:
line=line.strip('\n')
if line[0] == ">":
print line + "\n"
output.write(line+"\n")
else:
line=line.upper()
list=line.split()
for list in line:
if list == "A":
acount = acount +
#print acount
elif list == "C":
ccount = ccount +
#print ccount
elif list == "T":
tcount = tcount +
#print tcount
elif list == "G":
gcount=gcount +1
#print gcount
elif list != 'A'or 'T' or 'G' or 'C':
break
所以我需要得到每一行的总数,但是我的代码给了我整个文件的A's T等的总和。我希望我的输出类似于
SEQ1: 总A:3 总C: 等每个序列。
我可以采取哪些措施来修复我的代码来实现这一目标?
答案 0 :(得分:0)
我会在这些方面提出一些建议:
import re
def countNucleotides(filePath):
aCount = []
gCount = []
cCount = []
tCount = []
with open(filePath, 'rb') as data:
for line in data:
if not re.match(r'[agctAGCT]+',line):
break
aCount.append(notCount(line,'a'))
gCount.append(notCount(line,'g'))
cCount.append(notCount(line,'c'))
tCount.append(notCount(line,'t'))
def notCount(line, character):
appearances = 0
for item in line:
if item == character:
appearances += 1
return appearances
您可以在此之后打印它们。