Question

我有一个FASTA文件，可以通过SeqIO.parse轻松解析。

我对提取序列ID和序列长度很感兴趣。我用这些线来做，但我觉得它太重了（两次迭代，转换等）

from Bio import SeqIO
import pandas as pd


# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
                                                           "fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
                                                             "fasta")]
#converting lists to pandas Series    
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])

我只用一次迭代就可以做到，但我得到了一个字典：

records = SeqIO.parse(fastaFile, 'fasta')

我不知道怎么能让DataFrame.from_dict工作......

我的目标是迭代FASTA文件，并在每次迭代中将id和序列长度变为DataFrame。

对于那些想要帮助的人来说，这是short FASTA file。

Answer 1

你很明显 - 你肯定不应该解析文件两次，并且当你将它转换为{时，将数据存储在字典中会浪费计算资源{1}}数组稍后。

numpy返回一个生成器，因此您可以逐个记录迭代，构建如下列表：

SeqIO.parse()

有关从FASTA文件中解析ID和序列的更有效方法，请参阅Peter Cock's answer。

其余代码看起来对我很好。但是，如果您真的想要优化使用with open('sequences.fasta') as fasta_file: # Will close handle cleanly identifiers = [] lengths = [] for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator) identifiers.append(seq_record.id) lengths.append(len(seq_record.seq))，可以阅读以下内容：

关于最小化内存使用

咨询source of panda.Series，我们可以看到pandas被整体存储为data numpy：

ndarray

如果您将class Series(np.ndarray, Picklable, Groupable): """Generic indexed series (time series or otherwise) object. Parameters ---------- data: array-like Underlying values of Series, preferably as numpy ndarray设为identifiers，则可以直接在ndarray中使用它，而无需构建新数组（参数Series，默认copy）将阻止在不需要时创建新的False。通过将序列存储在列表中，您将强制Series将该列表强制转换为ndarray。

避免初始化列表

如果您事先知道您拥有多少序列（以及最长ID的长度），您可以初始化一个空ndarray来保存标识符，如下所示：

ndarray

当然，很难准确知道您将拥有多少序列，或者最大ID是什么，所以最简单的方法是让num_seqs = 50 max_id_len = 60 numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))从现有列表转换。但是，技术上是存储数据以便在numpy中使用的最快方式。

Answer 2

David在pandas方面给了你一个很好的答案，在Biopython方面你不需要通过SeqRecord使用Bio.SeqIO个对象，如果你想要的只是记录标识符和它们的序列长度 - 这应该更快：

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as fasta_file:  # Will close handle cleanly
    identifiers = []
    lengths = []
    for title, sequence in SimpleFastaParser(fasta_file):
        identifiers.append(title.split(None, 1)[0])  # First word is ID
        lengths.append(len(sequence))

Biopython SeqIO到Pandas Dataframe

2 个答案:

关于最小化内存使用

避免初始化列表