Question

尝试运行以下脚本时出现以下错误消息：背景：我试图将一个大的FASTA文件（~45Mb）拆分成基于基因id的较小文件。我想在每次“＆gt;”时把它砍掉出现。以下.py脚本允许我这样做。然而，Everynow然后我得到以下错误。任何反馈将不胜感激。

Script:
 import os 
    os.chdir("/vmb/Flavia_All/Python_Commands")
    outfile = os.chdir("/vmb/Flavia_All/Python_Commands")

import sys
infile = open(sys.argv[1])
outfile = []

for line in infile:
    if line.startswith(">"):
        if (outfile != []): outfile.close()
        genename = line.strip().split('|')[1]
        filename = genename+".fasta"
        outfile = open(filename,'w')
        outfile.write(line)
    else:
        outfile.write(line)
outfile.close()

运行脚本时出现错误消息：

Traceback (most recent call last):
  File "splitting_fasta.py", line 14, in <module>
    outfile = open(filename,'w')
IOError: [Errno 2] No such file or directory: 'AY378100.1_cds_AAR07818.1_173 [gene=pbrB/pbrC] [protein=PbrB/PbrC] [protein_id=AAR07818.1] [location=complement(152303..153451)].fasta'

*注意：AY378100.1_cds_AAR07818.1是该FASTA序列中的许多基因之一。这不是我出现过相同信息的唯一基因。我想停止删除提供此消息的每个基因。

Answer 1

似乎有一些fasta＆＃34;名称＆＃34; （带有>的第一个描述行）包含太多。特别是包含文件名中不允许的一些字符。

如果像GenBank ID这样的名字 - AY378100足够明确，那么：

genename = line.strip().split('|')[1].split('.')[0]

可能没问题。如果您有许多具有相同ID的fasta序列，您可能会选择：

genename = line.strip().split('|')[1].split('[')[0].strip()

使用Python

1 个答案: