Question

我使用python相当新，我喜欢它。但是我遇到了这个问题，我希望你能给我一些关于我所缺少的东西。

我在excel文件中有一个基因ID列表，我正在尝试使用xrld和biopython来检索序列并将我的结果保存（以fasta格式）到文本文档中。到目前为止，我的代码允许我在shell中查看结果，但它只保存文档中的最后一个序列。

这是我的代码：

import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
    if sh.row(rx)[0].value:
        from Bio import Entrez
        from Bio import SeqIO
        Entrez.email = "mail@xxx.com"
        in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
        record = SeqIO.parse(in_handle, "fasta")
        for record in SeqIO.parse(in_handle, "fasta"):
            print record.format("fasta")
        out_handle = open("example.txt", "w")
        SeqIO.write(record, out_handle, "fasta")
        in_handle.close()
        out_handle.close()

正如我所提到的，文件“example.txt”只有最后一个显示shell的序列（以fasta格式）。

有谁可以请帮助我如何获取我在同一文件中从NCBI检索的序列？

非常感谢

安东尼奥

Answer 1

我对python也很新，也喜欢它！这是我第一次尝试回答一个问题，但也许是因为你的循环结构和'w＆＃39;模式？或者尝试更改（＆＃34; example.txt＆＃34;，＆＃34; w＆＃34;）以追加模式（＆＃34; example.txt＆＃34;，＆＃34; a＆＃34;），如下所示？

import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
    if sh.row(rx)[0].value:
        from Bio import Entrez
        from Bio import SeqIO
        Entrez.email = "mail@xxx.com"
        in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
        record = SeqIO.parse(in_handle, "fasta")
        for record in SeqIO.parse(in_handle, "fasta"):
            print record.format("fasta")
        out_handle = open("example.txt", "a")
        SeqIO.write(record, out_handle, "fasta")
        in_handle.close()
        out_handle.close()

Answer 2

几乎在那里我的朋友们！

主问题是你的For循环每次循环都会关闭文件。我还解决了一些应该加速代码的小问题（例如你不断导入Bio每个循环）。

使用以下新代码：

out_handle = open("example.txt", "w")
import xlrd
import re
from Bio import Entrez
from Bio import SeqIO
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
    if sh.row(rx)[0].value:
        Entrez.email = "mail@xxx.com"
        in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=rx)
        record = SeqIO.parse(in_handle, "fasta")
        SeqIO.write(record, out_handle, "fasta")
        in_handle.close()
out_handle.close()

如果仍然有错误，那么excel文件中一定是个问题。如果错误仍然存在，请发送给我，我会帮助：）

使用excel中的ID列表以快速格式保存来自NCBI的序列

2 个答案: