我已经在MATLAB和Python中设置了两个相同的关于广播矩阵乘法的测试。对于Python,我使用NumPy;对于MATLAB,我使用mtimesx库,该库使用BLAS。
MATLAB
close all; clear;
N = 1000 + 100; % a few initial runs to be trimmed off at the end
a = 100;
b = 30;
c = 40;
d = 50;
A = rand(b, c, a);
B = rand(c, d, a);
C = zeros(b, d, a);
times = zeros(1, N);
for ii = 1:N
tic
C = mtimesx(A,B);
times(ii) = toc;
end
times = times(101:end) * 1e3;
plot(times);
grid on;
title(median(times));
Python
import timeit
import numpy as np
import matplotlib.pyplot as plt
N = 1000 + 100 # a few initial runs to be trimmed off at the end
a = 100
b = 30
c = 40
d = 50
A = np.arange(a * b * c).reshape([a, b, c])
B = np.arange(a * c * d).reshape([a, c, d])
C = np.empty(a * b * d).reshape([a, b, d])
times = np.empty(N)
for i in range(N):
start = timeit.default_timer()
C = A @ B
times[i] = timeit.default_timer() - start
times = times[101:] * 1e3
plt.plot(times, linewidth=0.5)
plt.grid()
plt.title(np.median(times))
plt.show()
pip
使用OpenBLAS安装的默认Python。MATLAB代码的运行时间为1毫秒,而Python的运行时间为5.8毫秒,我不知道为什么,因为它们似乎都在使用BLAS。
编辑
来自Anaconda:
In [7]: np.__config__.show()
mkl_info:
libraries = ['mkl_rt']
library_dirs = [...]
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = [...]
blas_mkl_info:
libraries = ['mkl_rt']
library_dirs = [...]
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = [...]
blas_opt_info:
libraries = ['mkl_rt']
library_dirs = [...]
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = [...]
lapack_mkl_info:
libraries = ['mkl_rt']
library_dirs = [...]
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = [...]
lapack_opt_info:
libraries = ['mkl_rt']
library_dirs = [...]
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = [...]
从numpy使用pip
In [2]: np.__config__.show()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
library_dirs = [...]
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
library_dirs = [...]
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
library_dirs = [...]
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
library_dirs = [...]
libraries = ['openblas']
language = f77
define_macros = [('HAVE_CBLAS', None)]
编辑2
我尝试将C = A @ B
替换为np.matmul(A, B, out=C)
,并得到了2倍的更差的时间,例如大约11ms。这真的很奇怪。
答案 0 :(得分:5)
您的MATLAB代码使用浮点数组,但是NumPy代码使用整数数组。这在时间上有很大的不同。对于MATLAB与NumPy之间的“一对一比较”,Python / NumPy代码还必须使用浮点数组。
但这不是唯一的重要问题。在NumPy github站点的issue 7569(然后在issue 8957)中讨论的NumPy存在缺陷。 “堆叠”数组的矩阵乘法不使用快速BLAS例程来执行乘法。这意味着二维以上数组的乘法可能比预期的慢得多。
二维数组的乘法确实使用快速例程,因此您可以通过将单个二维数组乘以一个循环来解决此问题。令人惊讶的是,尽管有Python循环的开销,但在许多情况下,它比应用于完整堆栈数组的@
,matmul
或einsum
快。
这是NumPy问题中显示的函数的变体,该函数在Python循环中执行矩阵乘法:
def xmul(A, B):
"""
Multiply stacked matrices A (with shape (s, m, n)) by stacked
matrices B (with shape (s, n, p)) to produce an array with
shape (s, m, p).
Mathematically equivalent to A @ B, but faster in many cases.
The arguments are not validated. The code assumes that A and B
are numpy arrays with the same data type and with shapes described
above.
"""
out = np.empty((a.shape[0], a.shape[1], b.shape[2]), dtype=a.dtype)
for j in range(a.shape[0]):
np.matmul(a[j], b[j], out=out[j])
return out
我的NumPy安装也使用MKL(它是Anaconda发行版的一部分)。以下是使用浮点值数组的A @ B
和xmul(A, B)
的时序比较:
In [204]: A = np.random.rand(100, 30, 40)
In [205]: B = np.random.rand(100, 40, 50)
In [206]: %timeit A @ B
4.76 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [207]: %timeit xmul(A, B)
582 µs ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
即使xmul
使用Python循环,也大约需要A @ B
的1/8。
答案 1 :(得分:1)
我认为这是内存排序的问题。 Matlab的zeros(a, b, c)
就像numpy的zeros((a, b, c), order='F')
,这不是默认值。
当然,正如您已经确定的,@
在与mtimesx
不同的轴上进行操作。为了使比较公平,应确保数组按matlab顺序排列,然后转置以处理语义差异
# note: `order` in reshape actually changes the resulting array data,
# not just its memory layout
A = np.arange(a * b * c).reshape([b, c, a], order='F').transpose((2, 0, 1))
B = np.arange(a * c * d).reshape([c, d, a], order='F').transpose((2, 0, 1))
答案 2 :(得分:0)
您能否再次使用最近发布的NumPy 1.16?我们将matmul重构为在内部二维中使用BLAS,这可以加快代码的速度。