我试图在python中编写一个脚本来解析一个大的fasta文件,我不想使用biopython,因为我正在学习脚本。该脚本需要将登录号,序列长度和序列gc内容打印到控制台。我已经能够提取入藏号,但由于它们被作为行读取而无法提取序列,这使我无法计算序列长度和gc含量。
有人能帮帮我吗? 我已尝试将这些行分组到一个列表中,但之后会在列表中创建多个列表,而且我也不确定如何加入它们。
seq=""
seqcount=0
seqlen=0
gc=0
#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")
with open(infile, "r") as fasta:
print("\n")
print ("Accession Number \t Sequence Length \t GC content (%)")
for line in fasta:
line.strip()
if line[0]==">":
seqcount+=1 #counts number sequences in file
accession=line.split("|")[3] #extract accession
seq=""
else:
seq+=line[:-1]
seqlen=len(seq)
print(accession, "\t \t", seqlen)
print("\n")
print("There are a total of", seqcount, "sequences in this file.")
答案 0 :(得分:2)
您距离正确的代码不远:
seq=""
seqcount=0
#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")
def pct_gc(s):
gc = s.count('G') + s.count('C') + s.count('g') + s.count('c')
total = len(s)
return gc*100.0/total
with open(infile, "r") as fasta:
print("\n")
print ("Accession Number\tSequence Length\tGC content (%)")
for line in fasta:
line = line.strip()
if line[0]==">":
if seq != "":
print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
seqcount+=1 #counts number sequences in file
accession=line.split("|")[3] #extract accession
seq=""
else:
seq+=line[:-1]
print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
print("\n")
print("There are a total of " + str(seqcount) + " sequences in this file.")
要寻找的东西: