如何在Python中加速复杂的矩阵点积

时间:2016-05-10 11:25:42

标签: python performance numpy theano matrix-multiplication

我试图找到一种快速计算python中两个complex64矩阵的点积的方法。

B = np.complex64(np.random.rand(3014, 4) + 1j*np.random.rand(3014, 4))
A = np.complex64(np.random.rand(32, 3014) + 1j*np.random.rand(32, 3014))

Numpy取得了最好的成绩:

%timeit np.dot(A, B)
10000 loops, best of 3: 165 µs per loop

Scipy是一样的:

%timeit scipy.dot(A, B)
10000 loops, best of 3: 165 µs per loop

Numpy Einsum慢得多:

%timeit np.einsum('ij, jk ->ik   ', A, B)
1000 loops, best of 3: 1.2 ms per loop

在for循环中编写点积(来自Comparing Python, Numpy, Numba and C++ for matrix multiplication)并不是更好(使用我的天真Numba实现只使用@jit()装饰器)

@jit()
def dot_py(A,B):
    m, n = A.shape
    p = B.shape[1]

    C = np.zeros((m,p), dtype=complex)

    for i in range(0,m):
        for j in range(0,p):
            for k in range(0,n):
                C[i,j] += A[i,k]*B[k,j] 
    return C


The slowest run took 153.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.02 ms per loop

使用Numba一般都没有提高我曾经使用的任何与Numpy相关的功能的速度(我可能做错了什么?)

只有Theano我才能接近:

import theano.tensor as T
x = T.cmatrix( 'x')
y = T.cmatrix( 'y')
z = theano.tensor.dot(x, y)
resultdot =  theano.function([x, y], z)
%timeit resultdot(A, B) 

10000 loops, best of 3: 182 µs per loop

numexpr不适用于此,是吗?

使用将np.dot应用于从内存映射数组(来自Efficient dot products of large memory-mapped arrays)显式读入内核内存的块的函数不会带来极大的加速:

def _block_slices(dim_size, block_size):
    """Generator that yields slice objects for indexing into 
    sequential blocks of an array along a particular axis
    """
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count > dim_size:
            raise StopIteration

def blockwise_dot(A, B, max_elements=int(2**27), out=None):
    """
    Computes the dot product of two matrices in a block-wise fashion. 
    Only blocks of `A` with a maximum size of `max_elements` will be 
    processed simultaneously.
    """

    m,  n = A.shape
    n1, o = B.shape

    if n1 != n:
        raise ValueError('matrices are not aligned')

    if A.flags.f_contiguous:
        # prioritize processing as many columns of A as possible
        max_cols = max(1, max_elements / m)
        max_rows =  max_elements / max_cols

    else:
        # prioritize processing as many rows of A as possible
        max_rows = max(1, max_elements / n)
        max_cols =  max_elements / max_rows

    if out is None:
        out = np.empty((m, o), dtype=np.result_type(A, B))
    elif out.shape != (m, o):
        raise ValueError('output array has incorrect dimensions')

    for mm in _block_slices(m, max_rows):
        out[mm, :] = 0
        for nn in _block_slices(n, max_cols):
            A_block = A[mm, nn].copy()  # copy to force a read
            out[mm, :] += np.dot(A_block, B[nn, :])
            del A_block

    return out




%timeit blockwise_dot(A, B, max_elements=int(2**27), out=None)
1000 loops, best of 3: 252 µs per loop

简而言之:使用任何Python库来计算点积是否有更快的方法?它甚至不使用多个核心,htop显示在Numpy计算期间只有一个核心处于活动状态。当与BLAS等联系时,这不应该是不同的吗?

0 个答案:

没有答案