我必须使用Python来创建一个解析fasta文件的通用解析器。
格式如下:
>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_1, whole genome shotgun sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_2, whole genome shotgun sequence
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA
>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_3, whole genome shotgun sequence
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA
我必须分别检索每个标题和序列,并将值插入我创建的MySQL数据库中。
eg: title1 = PHYSOscaffold_1
sequence2 = TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
title2 = PHYSOscaffold_2
sequence1 = CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA AGTTTATATATAAATTTCCTTTTTATTGGA
依旧......我将这些值插入MySQL表中。
我的解析输出应该是:
name1 \t sequence1 \t length_of_sequence \t a_count \t t_count \t g_count \t c_count
name2 \t sequence2 \t length_of_sequence \t a_count \t t_count \t g_count \t c_count
到目前为止,我已经编写了一个非常基本的脚本:
infile = open("simple.fasta")
line = infile.readline()
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
title = line
sequence_lines = []
while 1:
line = infile.readline().rstrip()
if line == "":
break
sequence_lines.append(line)
我只获得了我的第一个序列和头衔。
我是新手,需要专家的帮助。
答案 0 :(得分:0)
您只获得第一个标题和序列的原因是因为每次读取之间的行是空白的。所以当你这样做时:
if line == "":
break
它会在第一个序列后中断。使用readline()无法检测文件的结尾,因为它只会返回''。
这是解决问题的一个不优雅的解决方案:
infile = open("simple.fasta")
# State variable so we can handle the start of the file properly
# There are probably much better ways to do this.
start = True
# Its much better to iterate over the lines than to use a while 1 loop.
for line in infile.readlines():
if line.startswith(">"):
if start:
start = False
else:
# Each time we get here we have complete information for a read
# You can then store that read in your database.
sequence_lines = []
title = line
else:
if start:
raise TypeError("Not a FASTA file: %r" % line)
start = FALSE
sequence_lines.append(line)