Question

我试图读取FASTA文件，然后找到特定的motif(string)并打印出它发生的顺序和次数。 FASTA file只是一系列序列（字符串），以标题行开头，标题或新序列的开头是＆＃34;＆gt;＆＃34;。在标题之后的新行中是字母序列。我没有完成代码但到目前为止我有这个并且它给了我这个错误：

属性错误：＆＃39; str＆＃39;对象没有属性＆＃39; next＆＃39;

我不确定这里有什么问题。

import re

header=""
counts=0
newline=""

f1=open('fpprotein_fasta(2).txt','r')
f2=open('motifs.xls','w')
for line in f1:
    if line.startswith('>'):
        header=line
        #print header
        nextline=line.next()
        for i in nextline:
            motif="ML[A-Z][A-Z][IV]R"
            if re.findall(motif,nextline):
                counts+=1
                #print (header+'\t'+counts+'\t'+motif+'\n')
        fout.write(header+'\t'+counts+'\t'+motif+'\n')

f1.close()
f2.close()

Answer 1

该错误可能来自该行：

nextline=line.next()

line是您已阅读的字符串，其上没有next()方法。

部分问题在于您尝试混合两种不同的方式来阅读文件 - 您正在使用for line in f1和<handle>.next()进行迭代。

此外，如果您正在使用FASTA文件，我建议使用Biopython：它可以更轻松地处理序列集合。特别是，关于图案的Chapter 14将是你特别感兴趣的。这可能需要您了解更多有关Python的信息才能达到您想要的效果，但如果您要做的生物信息学比您在此处的示例所做的更多，那么它绝对值得投入时间。

Answer 2

这可能有助于您朝着正确的方向前进

import re

def parse(fasta, outfile):
    motif = "ML[A-Z][A-Z][IV]R"
    header = None
    with open(fasta, 'r') as fin, open(outfile, 'w') as fout:
            for line in fin:
                if line.startswith('>'):
                    if header is not None:
                        fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
                    header = line
                    count = 0
                else:
                    matches = re.findall(motif, line)
                    count += len(matches)
            if header is not None:
                fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
if __name__ == '__main__':
    parse("fpprotein_fasta(2).txt", "motifs.xls")

Answer 3

我不确定面食的东西，但我很确定你在这里做错了：

nextline=line.next()

行只是str，因此您无法调用str.next()

另外，关于文件，建议您使用：

with open('fpprotein_fasta(2).txt','r') as f1:

这将自动处理关闭文件。

我们鼓励您提供示例fasta文件，以便我可以尝试更正代码。

如何在python中读取fasta文件？

3 个答案: