Question

这是我的代码的子集

def main():
   filename = "path/to/something" # sys.argv[1]
   B = list(SeqIO.parse(filename + ".fasta", "fasta"))
   A = np.array([parse_id(datum.id) for datum in B])
   maxit, maxpop, maxind, maxlocus = header_keys.max(axis=0) + 1
   maxsite = len(data[0].seq)
   x = np.full((maxit, maxpop, maxind, maxlocus, maxsite), 'a', dtype="S")
   for a, b in zip(A, B):
       x[tuple(a)] = b

main()

该过程将95％的时间花在这个微小的for循环中

for a, b in zip(A, B):
        x[tuple(a)] = b

如何提高此代码的速度？ cython会在这里有用吗？我应该放弃并用C编码整个事情吗？

对象A，B和x

的说明

type(A) # `numpy.ndarray`
A.shape # (x, 4) --> x is a function of the parameters
A.ndim # 2
type(A[0][0]) # <type 'numpy.int64'>

type(B) # list
len(B) # x --> x is a function of the parameters
type(B[0]) # <class 'Bio.SeqRecord.SeqRecord'>

type(x) # <type 'numpy.ndarray'>
x.shape # (n, m, o, p, q) # depend on the parameters. The typical kind of values it would take would be (1, 20, 200, 10, 999999), that is a lot of sites and quite a bit of individuals.

这是一个极简主义的示例文件和一段代码，允许您尝试改进此循环。（该文件可能太短，无法检测到性能上的任何差异。您可能需要通过外推当前文件来构建更长的文件。）

示例文件

>it0pop0ind0locus0
ATGTTG
>it0pop0ind1locus0
ATGTTG
>it0pop0ind2locus0
ATGTTG
>it0pop0ind3locus0
ATGTTG
>it0pop0ind4locus0
ATGTTG
>it0pop0ind5locus0
ATGTTG
>it0pop0ind6locus0
ATGTTG
>it0pop0ind7locus0
ATGTTG
>it0pop1ind0locus0
ATGTTG
>it0pop1ind1locus0
ATGTTG
>it0pop1ind2locus0
ATGTTG
>it0pop1ind3locus0
ATGTTG
>it0pop1ind4locus0
ATGTTG
>it0pop1ind5locus0
ATGTTG
>it0pop1ind6locus0
ATGTTG
>it0pop1ind7locus0
ATGTTG

Answer 1

从您在评论中描述问题的方式来看，您似乎希望在每个基因座的每个站点的每个碱基的单个站点频率计数。我们明白这一点，我理解频率计数：

ACGTC
AGGTC
AGCTC   
ACCAC
TCGAG

产生类似的东西：

pos 0: [('A', 4), ('C', 0), ('G', 0), ('T', 1)]
pos 1: [('A', 0), ('C', 3), ('G', 2), ('T', 0)]
pos 2: [('A', 0), ('C', 2), ('G', 3), ('T', 0)]
pos 3: [('A', 2), ('C', 0), ('G', 0), ('T', 3)]
pos 4: [('A', 0), ('C', 4), ('G', 1), ('T', 1)]

如果是这种情况，从x的维度来看，inds / loci / sites的数量看起来相对可管理（400,10,1000000）。如果你只能解析特定迭代/子群体等的序列，我个人会这样做：

import numpy as np

# This takes a list of N sequences of length L and converts them to 
# a NxL array of integers so that computing frequencies is fast
def vectorize_seqs(seqs):
    trans_table = -np.ones(256, dtype='int')
    trans_table[np.frombuffer('ACGT', np.uint8)] = np.arange(4)
    return np.array([trans_table[np.frombuffer(seq, np.uint8)] for seq in seqs])

# Then compute the frequency counts
bins = np.arange(5, dtype=int)
vseqs = vectorize_seqs(seqs)
N, L = vseqs.shape
counts = np.array([np.histogram(vseqs[:,i], bins)[0] for i in range(L)])
freqs = counts / counts.sum(axis=1)[:,None]
# now you have either an array of Lx4 counts or Lx4 normalized frequencies

从这里，您可以为每个迭代，子群体等组合存储Lx4频率/计数。

提高为高维numpy对象赋值的性能

1 个答案: