Question

我正在使用numpy，我的模型涉及密集矩阵矩阵乘法。为了加快速度，我使用OpenBLAS多线程库来并行化numpy.dot函数。

我的设置如下，

操作系统：CentOS 6.2服务器#CPU = 12，＃MEM = 96GB
python版本：Python2.7.6
numpy：numpy 1.8.0
OpenBLAS + IntelMKL

$ OMP_NUM_THREADS=8 python test_mul.py

代码，我从https://gist.github.com/osdf/

获取

test_mul.py：

import numpy
import sys
import timeit

try:
    import numpy.core._dotblas
    print 'FAST BLAS'
except ImportError:
    print 'slow blas'

print "version:", numpy.__version__
print "maxint:", sys.maxint
print

x = numpy.random.random((1000,1000))

setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5

t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"

当我使用OMP_NUM_THREADS = 1 python test_mul.py时，结果是

dot: 0.200172233582 sec

OMP_NUM_THREADS = 2

dot: 0.103047609329 sec

OMP_NUM_THREADS = 4

dot: 0.0533880233765 sec

事情进展顺利。

然而，当我设置OMP_NUM_THREADS=8 ....代码开始时＃34;偶尔会工作＆＃34;。

有时它会起作用，有时它甚至不会运行并且给我核心转储。

OMP_NUM_THREADS > 10时。代码似乎一直在打破.. 我想知道这里发生了什么？是否有类似MAXIMUM数字线程的东西，每个进程可以使用？鉴于我的机器中有12个CPU，我可以提高这个限制吗？

由于

Answer 1

首先，我真的不明白你的意思是'OpenBLAS + IntelMKL'。这两个都是BLAS库，numpy应该只在运行时链接到其中一个。您应该检查这两个numpy中的哪一个实际上正在使用。您可以通过以下方式执行此操作：

$ ldd <path-to-site-packages>/numpy/core/_dotblas.so

更新： numpy/core/_dotblas.so was removed in numpy v1.10，但您可以检查numpy/core/multiarray.so的链接。

例如，我链接OpenBLAS：

...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...

如果您确实链接到OpenBLAS，那么您是否从源代码构建它？如果您这样做，您应该会在Makefile.rule中看到有一个注释选项：

...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...

默认情况下，OpenBLAS会尝试设置自动使用的最大线程数，但如果没有正确检测到这一行，您可以尝试自行取消注释和编辑此行。

另外，请记住，使用更多线程时，您可能会看到性能方面的收益递减。除非您的阵列非常大，否则使用超过6个线程不太可能提高性能，因为线程创建和管理所涉及的开销增加。

numpy OpenBLAS设置了最大线程数

1 个答案: