Question

我注意到了numpy.dot()函数的一个有趣的行为。我的Enterprise RedHat 6.7盒子有2个Xeon CPU，每个CPU有12个核心。我运行以下代码片段，然后检查htop

中的CPU利用率

以下代码使用我服务器上的所有核心：

import numpy as np
a = np.random.rand(1000, 1000)
b = np.random.rand(1000, 5)
z = a.dot(b) #or use %timeit a.dot(b) if you use ipython

修改下面是运行上面代码时htop的屏幕截图

但是，只要我在b添加一个维度，就像下面一样，只使用一个核心。

import numpy as np
a = np.random.rand(1000, 1000)
b = np.random.rand(1, 1000, 5) #or np.random.rand(n, 1000, 5) where n>=1
z = a.dot(b) #or use %timeit a.dot(b) if you use ipython

修改下面是运行上面代码时htop的屏幕截图

以下是import sys; sys.version

中我的python环境的配置

'2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

以下是numpy.show_config()

的配置信息

lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/opt/anaconda2/envs/portopt/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda2/envs/portopt/include']
blas_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/opt/anaconda2/envs/portopt/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda2/envs/portopt/include']
openblas_lapack_info: NOT AVAILABLE
lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/opt/anaconda2/envs/portopt/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda2/envs/portopt/include']
blas_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/opt/anaconda2/envs/portopt/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda2/envs/portopt/include']
mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/opt/anaconda2/envs/portopt/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda2/envs/portopt/include']

以前有人见过这个吗？我倾向于认为这是一个错误，而不是设计，因为显然还有更多的工作要做一个维度。还有，有办法强制numpy.dot腭化吗？提前谢谢！

更新我找到了一种加速计算的解决方法。请参阅下面的代码段。

import numpy as np
a = np.random.rand(1000, 1000) #in my program a variable
b = np.random.rand(100, 1000, 5) #b is a constant
z1 = a.dot(b)
c=b.swapaxes(0, 1).reshape(1000, 5*100) #the trick is to turn the 3d array into a 2d matrix 
z2 = a.dot(c).reshape(z1.shape) #then reshape the result to the desired shape.
np.allclose(z1, z2) #the results are identical but the computation of z2 is more than 10 times faster than that of z1 on my server.

但是，我同意从长远来看，我们应该像@hpaulj所建议的那样研究numpy代码并一劳永逸地修复问题（如果它是一个bug）。

Answer 1

我认为你必须研究C源代码，例如

https://github.com/numpy/numpy/blob/2f7827702ef6b6ac4b318103d5c0dfe2ff6e7eb3/numpy/core/src/multiarray/cblasfuncs.c

cblas_matrixproduct有很多代码可以检查2输入数组的维数。最后，有一个部分处理矩阵*矩阵乘法。

(PyArray_NDIM(ap1) == 2 && PyArray_NDIM(ap2) == 2)

看起来计算核心被NPY_BEGIN_ALLOW_THREADS和NPY_END_ALLOW_THREADS括起来

您的MKL代码可能可以替代BLAS。

现在的诀窍是找到处理3d数组的位置。不知何故，它在切片上运行，因此BLAS代码仍然可以看到2d数组。

我的猜测是多个核心的使用是在BLAS / MKL代码中完成的，而不是在numpy代码中。换句话说，numpy代码（对编译器）说“在这里使用线程和/或核心是可以的”，但不是“这里是如何根据数组维度在核心之间拆分”。

https://github.com/numpy/numpy/blob/386639363233165bcba1f1ba7b10aff3c40d46b3/numpy/core/src/multiarray/multiarraymodule.c

PyArray_MatrixProduct2似乎是决定如何调用我之前找到的BLAS点函数的函数。

2 2d矩阵案例似乎用以下方式处理：

#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
        (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
         NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
    return cblas_matrixproduct(typenum, ap1, ap2, out);
}

否则必须使用代码（确保校正尺寸兼容后）：

NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(ap2));
while (it1->index < it1->size) {
    while (it2->index < it2->size) {
        dot(it1->dataptr, is1, it2->dataptr, is2, op, l, ret);
        op += os;
        PyArray_ITER_NEXT(it2);
    }
    PyArray_ITER_NEXT(it1);
    PyArray_ITER_RESET(it2);
}
NPY_END_THREADS_DESCR(PyArray_DESCR(ap2));

其中dot = PyArray_DESCR(ret)->f->dotfunc;已根据dtype定义。

我不确定我是否已经回答了您的问题，但很明显代码很复杂，并且关于您或我如何划分任务的简单推理并不适用。

为什么numpy.dot无法在具有2个以上维度的ndarray上进行并行化

1 个答案: