我试图找到一种快速计算python中两个complex64矩阵的点积的方法。
B = np.complex64(np.random.rand(3014, 4) + 1j*np.random.rand(3014, 4))
A = np.complex64(np.random.rand(32, 3014) + 1j*np.random.rand(32, 3014))
Numpy取得了最好的成绩:
%timeit np.dot(A, B)
10000 loops, best of 3: 165 µs per loop
Scipy是一样的:
%timeit scipy.dot(A, B)
10000 loops, best of 3: 165 µs per loop
Numpy Einsum慢得多:
%timeit np.einsum('ij, jk ->ik ', A, B)
1000 loops, best of 3: 1.2 ms per loop
在for循环中编写点积(来自Comparing Python, Numpy, Numba and C++ for matrix multiplication)并不是更好(使用我的天真Numba实现只使用@jit()装饰器)
@jit()
def dot_py(A,B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m,p), dtype=complex)
for i in range(0,m):
for j in range(0,p):
for k in range(0,n):
C[i,j] += A[i,k]*B[k,j]
return C
The slowest run took 153.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.02 ms per loop
使用Numba一般都没有提高我曾经使用的任何与Numpy相关的功能的速度(我可能做错了什么?)
只有Theano我才能接近:
import theano.tensor as T
x = T.cmatrix( 'x')
y = T.cmatrix( 'y')
z = theano.tensor.dot(x, y)
resultdot = theano.function([x, y], z)
%timeit resultdot(A, B)
10000 loops, best of 3: 182 µs per loop
numexpr不适用于此,是吗?
使用将np.dot应用于从内存映射数组(来自Efficient dot products of large memory-mapped arrays)显式读入内核内存的块的函数不会带来极大的加速:
def _block_slices(dim_size, block_size):
"""Generator that yields slice objects for indexing into
sequential blocks of an array along a particular axis
"""
count = 0
while True:
yield slice(count, count + block_size, 1)
count += block_size
if count > dim_size:
raise StopIteration
def blockwise_dot(A, B, max_elements=int(2**27), out=None):
"""
Computes the dot product of two matrices in a block-wise fashion.
Only blocks of `A` with a maximum size of `max_elements` will be
processed simultaneously.
"""
m, n = A.shape
n1, o = B.shape
if n1 != n:
raise ValueError('matrices are not aligned')
if A.flags.f_contiguous:
# prioritize processing as many columns of A as possible
max_cols = max(1, max_elements / m)
max_rows = max_elements / max_cols
else:
# prioritize processing as many rows of A as possible
max_rows = max(1, max_elements / n)
max_cols = max_elements / max_rows
if out is None:
out = np.empty((m, o), dtype=np.result_type(A, B))
elif out.shape != (m, o):
raise ValueError('output array has incorrect dimensions')
for mm in _block_slices(m, max_rows):
out[mm, :] = 0
for nn in _block_slices(n, max_cols):
A_block = A[mm, nn].copy() # copy to force a read
out[mm, :] += np.dot(A_block, B[nn, :])
del A_block
return out
%timeit blockwise_dot(A, B, max_elements=int(2**27), out=None)
1000 loops, best of 3: 252 µs per loop
简而言之:使用任何Python库来计算点积是否有更快的方法?它甚至不使用多个核心,htop显示在Numpy计算期间只有一个核心处于活动状态。当与BLAS等联系时,这不应该是不同的吗?