Question

初学者在这里。我想在python中编写一个函数，在fasta文件中搜索基因名称，然后返回与之对应的相应读数。

FASTA文件示例：

>name1
AATTCCGG
>name2
ATCGATCG

到目前为止我的代码（非常简陋）：

def findseq(name):
    with open('cel39.fa', 'rb') as csv_file:
        csv_reader = csv.reader(csv_file)
        for i in csv_reader:
            if i == '>' + name:
                return i+1
                break

这实际上不起作用，因为我不能返回'i + 1'。我也可以迭代len（csv_reader）因为'len'不是属性。此外，我不确定是否有更高效（但简单）的搜索系统，因此我不需要每次迭代整个文件（将是数千行）。

具体来说，有没有更好的方法来读取Fasta文件？有没有办法可以回复我的阅读？

findseq(name1)

应该返回'AATTCCGG'

谢谢！

Answer 1

看一下python库：Biothon

它包含大量有用的工具和方法。

以下是解析fasta文件的示例：

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

此示例打印出fasta文件中的所有记录。

出于您的目的：

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    if seq_record.id == name:
        return seq_record.seq

Answer 2

由于FASTA文件序列扩展到多行，因此您必须将行连接到＆gt;的下一个实例。找到了。下面的代码生成一个字典，其中基因名称为关键，基因序列为值。

with open('cel39.fa', 'rb') as fp:
    lines = fp.read().splitlines()

geneDict = {}

# Just to start populating the dictionary later
geneName = 'dummy'
fastaSeq = ''

for line in lines:
    if line[0] == '>':
        geneDict.update({geneName: fastaSeq})
        geneName = line[1:]
        fastaSeq = ''
    else:
        fastaSeq += line

geneDict.update({geneName: fastaSeq}) # Putting the values from the last loop
geneDict.pop('dummy') # Now removing the dummy

print geneDict['name1']
print geneDict['name2']

它打印出来：

AATTCCGG
ATCGATCG

在python中搜索Fasta文件，有效地返回读取

2 个答案: