Question

已更新解决方案

我很难理解Pool。

我想一次对12组独立的数据进行分析。各个分析不相互依赖，也不共享数据，因此，如果可以并行运行这些分析，我希望其速度提高近12倍。

但是，使用Pool.map，我无法获得如此出色的性能。为了创建一个我希望将近12倍加速的情况，我编写了一个非常简单的函数，该函数由for循环组成，并且仅基于循环变量计算算术。没有结果存储，也没有数据加载。我之所以这样做，是因为这里的另一个线程谈到了L2缓存限制性能，因此我试图将问题简化为没有数据，只有纯计算的问题。

import multiprocessing as mp
import mp_cfg as _cfg
import os
import time as _tm


NUM_CORE = 12         # set to the number of cores you want to use
NUM_COPIES_2_RUN = 12 # number of times we want to run the function
print("NUM_CORE       %d" % NUM_CORE)
print("NUM_COPIES     %d" % NUM_COPIES_2_RUN)

####################################################
###############################  FUNCTION DEFINITION
####################################################
def run_me(args):
    """
    function to be run NUM_COPIES_2_RUN times  (identical)
    """
    num = args[0]
    tS  = args[1]

    t1 = _tm.time()
    for i in range(5000000):
        v = ((i+i)*(i*3))/100000.

    t2 = _tm.time()
    print("work %(wn)d  %(t2).3f - %(t1).3f  = %(dt).3f" % {"wn" : num, "t1" : (t1-tS), "t2" : (t2-tS), "dt" : (t2-t1)})        

####################################################
##################################  serial execution
####################################################
print("Running %d copies of the same code in serial execution" % NUM_COPIES_2_RUN)
tStart_serial = _tm.time()

for i in range(NUM_COPIES_2_RUN):
    run_me([i, tStart_serial])

tEnd_serial   = _tm.time()

print("total time:  %.3f" % (tEnd_serial - tStart_serial))

####################################################
##############################################  Pool
####################################################
print("Running %d copies of the same code using Pool.map_async" % NUM_COPIES_2_RUN)
tStart_pool   = _tm.time()

pool = mp.Pool(NUM_CORE)
args = []
for n in range(NUM_COPIES_2_RUN):
    args.append([n, tStart_pool])

pool.map_async(run_me, args)
pool.close()
pool.join()

tEnd_pool     = _tm.time()    

print("total time:  %.3f" % (tEnd_pool - tStart_pool))

当我在16核Linux机器上运行它时，我得到了（参数集1）

NUM_CORE       12
NUM_COPIES     12
Running 12 copies of the same code in serial execution
work 0  0.818 - 0.000  = 0.818
work 1  1.674 - 0.818  = 0.855
work 2  2.499 - 1.674  = 0.826
work 3  3.308 - 2.499  = 0.809
work 4  4.128 - 3.308  = 0.820
work 5  4.937 - 4.128  = 0.809
work 6  5.747 - 4.937  = 0.810
work 7  6.558 - 5.747  = 0.811
work 8  7.368 - 6.558  = 0.810
work 9  8.172 - 7.368  = 0.803
work 10  8.991 - 8.172  = 0.819
work 11  9.799 - 8.991  = 0.808
total time:  9.799
Running 12 copies of the same code using Pool.map
work 1  0.990 - 0.018  = 0.972
work 8  0.991 - 0.019  = 0.972
work 5  0.992 - 0.019  = 0.973
work 7  0.992 - 0.019  = 0.973
work 3  1.886 - 0.019  = 1.867
work 6  1.886 - 0.019  = 1.867
work 4  2.288 - 0.019  = 2.269
work 9  2.290 - 0.019  = 2.270
work 0  2.293 - 0.018  = 2.274
work 11  2.293 - 0.023  = 2.270
work 2  2.294 - 0.019  = 2.275
work 10  2.332 - 0.019  = 2.313
total time:  2.425

当我更改参数（参数组2）并再次运行时，我得到了

NUM_CORE       12
NUM_COPIES     6
Running 6 copies of the same code in serial execution
work 0  0.798 - 0.000  = 0.798
work 1  1.579 - 0.798  = 0.780
work 2  2.355 - 1.579  = 0.776
work 3  3.131 - 2.355  = 0.776
work 4  3.908 - 3.131  = 0.777
work 5  4.682 - 3.908  = 0.774
total time:  4.682
Running 6 copies of the same code using Pool.map_async
work 1  0.921 - 0.015  = 0.906
work 4  0.922 - 0.015  = 0.907
work 2  0.922 - 0.015  = 0.908
work 5  0.932 - 0.015  = 0.917
work 3  2.099 - 0.015  = 2.085
work 0  2.101 - 0.014  = 2.086
total time:  2.121

使用另一组参数（参数组3），

NUM_CORE       4
NUM_COPIES     12
Running 12 copies of the same code in serial execution
work 0  0.784 - 0.000  = 0.784
work 1  1.564 - 0.784  = 0.780
work 2  2.342 - 1.564  = 0.778
work 3  3.121 - 2.342  = 0.779
work 4  3.901 - 3.121  = 0.779
work 5  4.682 - 3.901  = 0.782
work 6  5.462 - 4.682  = 0.780
work 7  6.243 - 5.462  = 0.780
work 8  7.024 - 6.243  = 0.781
work 9  7.804 - 7.024  = 0.780
work 10  8.578 - 7.804  = 0.774
work 11  9.360 - 8.578  = 0.782
total time:  9.360
Running 12 copies of the same code using Pool.map_async
work 3  0.862 - 0.006  = 0.856
work 1  0.863 - 0.006  = 0.857
work 5  1.713 - 0.863  = 0.850
work 4  1.713 - 0.863  = 0.851
work 0  2.108 - 0.006  = 2.102
work 2  2.112 - 0.006  = 2.106
work 6  2.586 - 1.713  = 0.873
work 7  2.587 - 1.713  = 0.874
work 8  3.332 - 2.109  = 1.223
work 9  3.333 - 2.113  = 1.220
work 11  3.456 - 2.587  = 0.869
work 10  3.456 - 2.586  = 0.870
total time:  3.513

这让我完全困惑。特别是对于参数集2，我允许将12个内核用于6个独立的执行线程，但是我的速度仅为2倍。

这是怎么回事？我也尝试过使用map()和map_async()，但是在性能上似乎没有区别。

UPDATE ：

所以这里发生了几件事：

1）我的内核比我意识到的要少。我以为我有16个核心，只有8个物理核心和16个逻辑核心，因为打开了超线程。

2）即使我只说了要在这8个物理内核上运行的4个独立进程，我也没有获得预期的速度。在这种情况下，我期望的像是3.5倍。当我多次运行上述测试时，我可能会获得10％的速度提升。有时，我得到的范围是1.5倍至3.5倍-这似乎很奇怪，因为我拥有足够多的内核来进行计算，但是在大多数情况下，并行化似乎表现欠佳。如果我在系统上还有很多其他进程，这将是有道理的，但是我是唯一的用户，并且没有任何计算密集型的运行。

3）事实证明，打开超线程会导致我的硬件似乎使用不足。如果我关闭超线程

https://www.golinuxhub.com/2018/01/how-to-disable-or-enable-hyper.html

每次运行上面发布的脚本时，我都会获得预期的〜3.5倍加速-这是我的期望。

PS）现在，我进行分析的实际代码是用python编写的，其中数字密集型部分使用cython编写。它还使用numpy。我的numpy链接到数学内核库（MKL），该库可以利用多个内核。在像我的情况下，需要并行运行多个独立进程的情况下，让MKL使用多个内核没有意义，从而中断了另一个内核上的运行线程，尤其是因为对诸如dot之类的调用不够足够昂贵，足以克服使用多个内核的开销。

我以为这可能是最初的问题：

Limit number of threads in numpy

导出MKL_NUM_THREADS = 1

确实提高了性能，但这并没有达到我的期望，这促使我在这里提出这个问题（为简单起见，我避免完全使用numpy）。

Answer 1

我的猜测是您正在<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.22.2</version> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-failsafe-plugin</artifactId> <version>2.22.2</version> <executions> <execution> <id>integration-test</id> <goals> <goal>integration-test</goal> </goals> </execution> </executions> </plugin>循环中最大化CPU：

for

您拥有16个内核，并且在该内核下将其最大化，这似乎是违反直觉的，但是当您为每个内核尝试类似for i in range(5000000): v = ((i+i)*(i*3))/100000.的函数时会发生什么—串行运行需要16s，运行时需要1s在每个核心上？如果是这样，那么似乎归结于cpu限制或python time.sleep(1)库的内部。

这是我的机器上使用8个内核的示例，它使用我能想到的最简单的示例将时间减少了8倍。

Pool

python Pool上受CPU限制的多重处理，没有接近预期的速度

1 个答案: