如何加速Numpy中的转换矩阵创建?

时间:2012-11-04 13:41:55

标签: python numpy scipy

以下是我所知道的计算马尔可夫链中的转换并使用它来填充转换矩阵的最基本方法:

def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
    for i in xrange(1, len(markov_chain)):
        old_state = markov_chain[i - 1]
        new_state = markov_chain[i]
        transition_counts_matrix[old_state, new_state] += 1

我尝试过以三种不同的方式加快速度:

1)使用基于此Matlab代码的稀疏矩阵单行程序:

transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))

在Numpy / SciPy中,看起来像这样:

def get_sparse_counts_matrix(markov_chain, number_of_states):
    return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states)) 

我尝试了几个Python调整,比如使用zip():

for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
    transition_counts_matrix[old_state, new_state] += 1 

和队列:

old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
    old_and_new_states_holder.put(new_state)
    old_state = old_and_new_states_holder.get()
    transition_counts_matrix[old_state, new_state] += 1

但是这三种方法都没有加速。实际上,除了zip()解决方案之外的所有内容都比我原来的解决方案慢了至少10倍。

还有其他值得研究的解决方案吗?



用于构建来自大量链的转换矩阵的改进解决方案
对上述问题的最佳答案是DSM。但是,对于任何想要根据数百万马尔可夫链列表填充转换矩阵的人来说,最快的方法就是这样:

def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
    flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
    transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)

def get_fake_transitions(markov_chains):
    fake_transitions = []
    for i in xrange(1,len(markov_chains)):
        old_chain = markov_chains[i - 1]
        new_chain = markov_chains[i]
        end_of_old = old_chain[-1]
        beginning_of_new = new_chain[0]
        fake_transitions.append((end_of_old, beginning_of_new))
    return fake_transitions

def decrement_fake_transitions(fake_transitions, counts_matrix):
    for old_state, new_state in fake_transitions:
        counts_matrix[old_state, new_state] -= 1

def fast_get_transition_counts_matrix(markov_chains, number_of_states):
    """50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
    You might need to break up the chains into manageable chunks that don't exceed your memory.
    """
    transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
    fake_transitions = get_fake_transitions(markov_chains)
    markov_chains = list(itertools.chain(*markov_chains))
    fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
    decrement_fake_transitions(fake_transitions, transition_counts_matrix)
    return transition_counts_matrix

4 个答案:

答案 0 :(得分:8)

只是为了踢,而且因为我一直想尝试一下,我将Numba应用于您的问题。在代码中,这只涉及添加一个装饰器(虽然我已经直接调用了所以我可以测试numba在这里提供的jit变体):

import numpy as np
import numba

def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
    for i in xrange(1, len(markov_chain)):
        old_state = markov_chain[i - 1]
        new_state = markov_chain[i]
        transition_counts_matrix[old_state, new_state] += 1

autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)

t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))

然后时间:

In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop

In [11]: %timeit autojit_func(t,m2)                         

10000 loops, best of 3: 67.5 us per loop

In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loop

autojit方法根据运行时输入进行一些猜测,而jit函数则指定了类型。你必须要小心一点,因为在这些早期阶段的numba如果传入错误的输入类型,则不会传达jit的错误。它会吐出一个错误的答案。

尽管如此,在没有任何代码更改的情况下获得35x和485x的加速并且只是添加对numba的调用(也可以称为装饰器)在我的书中非常令人印象深刻。您可能使用cython获得类似的结果,但它需要更多样板并编写setup.py文件。

我也喜欢这个解决方案,因为代码仍然可读,您可以按照最初考虑实现算法的方式编写代码。

答案 1 :(得分:6)

利用np.bincount这样的事情怎么样?不是超级健壮,但功能齐全。 [感谢@Warren Weckesser的设置。]

import numpy as np
from collections import Counter

def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
    for i in xrange(1, len(markov_chain)):
        old_state = markov_chain[i - 1]
        new_state = markov_chain[i]
        transition_counts_matrix[old_state, new_state] += 1

def using_counter(chain, counts_matrix):
    counts = Counter(zip(chain[:-1], chain[1:]))
    from_, to = zip(*counts.keys())
    counts_matrix[from_, to] = counts.values()

def using_bincount(chain, counts_matrix):
    flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
    counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)

def using_bincount_reshape(chain, counts_matrix):
    flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
    return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)

给出:

In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()

In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop

In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop

In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop

[编辑]

避免flat(以非就地工作为代价)可以为小型矩阵节省一些时间:

In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop

答案 2 :(得分:0)

这是一种更快的方法。这个想法是计算每个转换的出现次数,并在矩阵的向量化更新中使用计数。 (我假设在markov_chain中可以多次发生相同的转换。)Counter库中的collections类用于计算每次转换的出现次数。

from collections import Counter

def update_matrix(chain, counts_matrix):
    counts = Counter(zip(chain[:-1], chain[1:]))
    from_, to = zip(*counts.keys())
    counts_matrix[from_, to] += counts.values()

时间示例,在ipython中:

In [64]: t = np.random.randint(0,50, 500)

In [65]: m1 = zeros((50,50))

In [66]: m2 = zeros((50,50))

In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop

In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop

它更快,但不快几个数量级。为了真正加快速度,您可以考虑在Cython中实现它。

答案 3 :(得分:0)

好的,几乎没有什么想法可以篡改,但有一点点改进(以人类未成本为代价)

让我们从长度为3000的0到9之间的整数随机向量开始:

L = 3000
N = 10
states = array(randint(N),size=L)
transitions = np.zeros((N,N))

您的方法在我的计算机上的 timeit 性能 11.4 ms

稍微改进的第一件事是避免两次读取数据,将其存储在临时变量中:

old = states[0]
for i in range(1,len(states)):
    new = states[i]
    transitions[new,old]+=1
    old=new

这样可以提高10%,并将时间缩短至 10.9 ms

更复杂的方法是使用步幅:

def rolling(a, window):
    shape = (a.size - window + 1, window)
    strides = (a.itemsize, a.itemsize)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

state_2 = rolling(states, 2)
for i in range(len(state_2)):
    l,m = state_2[i,0],state_2[i,1]
    transitions[m,l]+=1

步幅允许您读取数组的连续数字,欺骗数组以认为行以不同的方式开始(好吧,它没有很好地描述,但如果你花一些时间阅读有关步幅的话,你会得到它) 这种方法失去了性能,进入 12.2 ms ,但它是进一步欺骗系统的走廊。将转换矩阵和跨步阵列展平为一维数组,可以将性能提高一点:

transitions = np.zeros(N*N)
state_2 = rolling(states, 2)
state_flat = np.sum(state_2 * array([1,10]),axis=1)
for i in state_flat:
    transitions[i]+=1
transitions.reshape((N,N))

这可以归结为 7.75 ms 。这不是一个数量级,但无论如何它都要好30%:)。