我是pythong编程的新手,并且有一个fasta文件,我想解析它在特定的软件中使用。该文件包含两行:1)序列标识符和由空格分隔的分类,分类中的最后一个物种名称也可以包含空格,以及2)dna序列(参见下面的示例):
>123876987 Bacteria;test;test;test test test
ATCTGCTGCATGCATGCATCGACTGCATGAC
>239847239 Bacteria;test;test;test1 test1 test1
ACTGACTGCTAGTACGATCGCTGCTGCATGACTGAC
经过大量的努力和一些帮助,我设法将我的fasta文件解析为只显示序列ID和分类的分类文件:
123876987 Bacteria;test;test;test test test
239847239 Bacteria;test;test;test1 test1 test1
但是,我使用的软件要求以特殊方式格式化分类法文件。分类文件的内容必须:1)具有'>'从fasta文件中删除,2)通过选项卡将标识符和分类法与每个序列标题分开(即用标签替换字符串中空格的第一个出现),3)将分类字符串中的所有空格替换为'_',并用分号完成分类(见下面的例子):
123876987 Bacteria;test;test;test_test_test;
239847239 Bacteria;test;test;test1_test1_test1;
我一直试图通过摆弄我的工作脚本来实现这一目标:
with open("test.fasta", "r") as fasta, open("test.tax", "w") as tax:
while True:
SequenceHeader= fasta.readline()
Sequence= fasta.readline()
if SequenceHeader == '':
break
tax.write(SequenceHeader.replace('>', ''))
如此嘲笑:
with open("test.fasta", "r") as fasta, open("clean_corrected.tax", "w") as tax:
while True:
SequenceHeader= fasta.readline()
Sequence= fasta.readline()
old = {'>',' '}
new = {'','_'}
CorrectedHeader = SequenceHeader.replace('old','new')
if SequenceHeader == '':
break
tax.write(CorrectedHeader)
但这根本不起作用。有谁知道我该怎么做呢?
非常感谢你的帮助!
答案 0 :(得分:2)
以下内容应该有效:
with open("test.fasta", "r") as fasta, open("test.tax", "w") as tax:
for line in fasta:
if line.startswith('>'):
line = line[1:] # remove the '>' from start of line
line = line.replace(' ', '\t', 1) # replace first space with a tab
line = line.replace(' ', '_') # replace remaining spaces with '_'
line = line.strip() + ';\n' # add ';' to the end
tax.write(line) # write to the output file