所以问题基本上给了我19个DNA序列,并希望我制作一个基本的文本表。第一列必须是序列ID,第二列是序列的长度,第三列是“A”的数量,第四列是“G”,第五列是“C”,第六列是“T” ,第7个是%GC,第8个是序列中是否有“TGA”。然后我得到所有这些值并写一个表到“dna_stats.txt”
这是我的代码:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
我发现你必须将变量转换为字符串。当我在终端中打印出来时,我正确计算了所有值。但是,我的第一列只有19,当它应该是1,2,3,4,5等。表示所有序列。我尝试了其他变量,它只获得了整个文件的总量。我开始尝试制作桌子但尚未完成。
所以我最大的问题是我不知道如何获取每个特定行的变量值。
我是python和编程的新手,所以任何提示或技巧或其他任何东西都会有所帮助。
我正在使用python版本2.7
答案 0 :(得分:1)
那么,你最大的问题是:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
缩进很重要。这表示“对于每一行(#1),打开一个文件然后再遍历每个行(#2)......”
取消缩进这些内容。
答案 1 :(得分:0)
这会将信息放在字典中,并允许DNA序列通过多行
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()