Question

我正在尝试并行化一个令人尴尬的并行for循环（previously asked here）并确定符合我的参数的this implementation：

    with Manager() as proxy_manager:
        shared_inputs = proxy_manager.list([datasets, train_size_common, feat_sel_size, train_perc,
                                            total_test_samples, num_classes, num_features, label_set,
                                            method_names, pos_class_index, out_results_dir, exhaustive_search])
        partial_func_holdout = partial(holdout_trial_compare_datasets, *shared_inputs)

        with Pool(processes=num_procs) as pool:
            cv_results = pool.map(partial_func_holdout, range(num_repetitions))

我需要使用proxy object（进程之间共享）的原因是共享代理列表datasets中的第一个元素，它是一个大对象列表（每个大约200-300MB）。此datasets列表通常包含5-25个元素。我通常需要在HPC集群上运行该程序。

这是一个问题，当我用32个进程和50GB内存运行这个程序（num_repetitions = 200，数据集是10个对象的列表，每个250MB）时，我看不到加速，即使是16倍（有32个并行进程）。我不明白为什么 - 任何线索？任何明显的错误，或错误的选择？我在哪里可以改进这个实现？任何替代方案？

我确信之前已经讨论过这个问题，原因可能多种多样，而且非常具体，因此我要求你提供2美分。感谢。

更新：我使用cProfile进行了一些分析以获得更好的想法 - 这是一些信息，按累计时间排序。

In [19]: p.sort_stats('cumulative').print_stats(50)
Mon Oct 16 16:43:59 2017    profiling_log.txt

         555404 function calls (543552 primitive calls) in 662.201 seconds

   Ordered by: cumulative time
   List reduced from 4510 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    897/1    0.044    0.000  662.202  662.202 {built-in method builtins.exec}
        1    0.000    0.000  662.202  662.202 test_rhst.py:2(<module>)
        1    0.001    0.001  661.341  661.341 test_rhst.py:70(test_chance_classifier_binary)
        1    0.000    0.000  661.336  661.336 /Users/Reddy/dev/neuropredict/neuropredict/rhst.py:677(run)
        4    0.000    0.000  661.233  165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:533(wait)
        4    0.000    0.000  661.233  165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:263(wait)
       23  661.233   28.749  661.233   28.749 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000  661.233  661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:261(map)
        1    0.000    0.000  661.233  661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:637(get)
        1    0.000    0.000  661.233  661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:634(wait)
    866/8    0.004    0.000    0.868    0.108 <frozen importlib._bootstrap>:958(_find_and_load)
    866/8    0.003    0.000    0.867    0.108 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
    720/8    0.003    0.000    0.865    0.108 <frozen importlib._bootstrap>:641(_load_unlocked)
    596/8    0.002    0.000    0.865    0.108 <frozen importlib._bootstrap_external>:672(exec_module)
   1017/8    0.001    0.000    0.863    0.108 <frozen importlib._bootstrap>:197(_call_with_frames_removed)
   522/51    0.001    0.000    0.765    0.015 {built-in method builtins.__import__}

现在按time

排序的分析信息

In [20]: p.sort_stats('time').print_stats(20)
Mon Oct 16 16:43:59 2017    profiling_log.txt

         555404 function calls (543552 primitive calls) in 662.201 seconds

   Ordered by: internal time
   List reduced from 4510 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       23  661.233   28.749  661.233   28.749 {method 'acquire' of '_thread.lock' objects}
   115/80    0.177    0.002    0.211    0.003 {built-in method _imp.create_dynamic}
      595    0.072    0.000    0.072    0.000 {built-in method marshal.loads}
        1    0.045    0.045    0.045    0.045 {method 'acquire' of '_multiprocessing.SemLock' objects}
    897/1    0.044    0.000  662.202  662.202 {built-in method builtins.exec}
        3    0.042    0.014    0.042    0.014 {method 'read' of '_io.BufferedReader' objects}
2037/1974    0.037    0.000    0.082    0.000 {built-in method builtins.__build_class__}
      286    0.022    0.000    0.061    0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:12(docformat)
     2886    0.021    0.000    0.021    0.000 {built-in method posix.stat}
       79    0.016    0.000    0.016    0.000 {built-in method posix.read}
      597    0.013    0.000    0.021    0.000 <frozen importlib._bootstrap_external>:830(get_data)
      276    0.011    0.000    0.013    0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/sre_compile.py:250(_optimize_charset)
      108    0.011    0.000    0.038    0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:626(_construct_argparser)
     1225    0.011    0.000    0.050    0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
     7179    0.009    0.000    0.009    0.000 {method 'splitlines' of 'str' objects}
       33    0.008    0.000    0.008    0.000 {built-in method posix.waitpid}
      283    0.008    0.000    0.015    0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:128(indentcount_lines)
        3    0.008    0.003    0.008    0.003 {method 'poll' of 'select.poll' objects}
     7178    0.008    0.000    0.008    0.000 {method 'expandtabs' of 'str' objects}
      597    0.007    0.000    0.007    0.000 {method 'read' of '_io.FileIO' objects}

按percall信息排序的更多分析信息：

更新2

我之前提到的大型列表datasets中的元素通常不是很大 - 它们通常分别为10-25MB。但是根据所使用的浮点精度，样本数和特征数，每个元素也可轻松增长到500MB-1GB。因此，我更喜欢可以扩展的解决方案。

更新3：

holdout_trial_compare_datasets中的代码使用scikit-learn的GridSearchCV方法，如果我们设置了n_jobs＆gt;，它会在内部使用joblib库。 1（或者每当我们设置它时）。这可能会导致多处理和joblib之间的一些不良交互。所以尝试另一个配置，我根本没有设置n_jobs（这应该默认在scikit-learn中没有并行性）。会让你发布。

Answer 1

根据评论中的讨论，我做了一个小型实验，比较了三个版本的实现：

v1：基本上与您的方法相同，事实上，因为partial(f1, *shared_inputs)将立即解包proxy_manager.list，Manager.List不涉及此处，数据传递给工作人员，内部队列为{{ 1}}。
v2：v2使用Pool，工作函数将接收Manager.List对象，它通过内部连接获取共享数据到服务器进程。
v3：来自父级的子进程共享数据，利用ListProxy系统调用。

fork(2)

采用16核E5-2682 VPC的结果，很明显v3更好地扩展：

Answer 2

{method 'acquire' of '_thread.lock' objects}

查看您的探查器输出我会说共享对象锁定/解锁开销超过了多线程的速度增益。

重构，以便将工作分配给那些不需要彼此交谈的工人。

具体而言，如果可能，每个数据堆得出一个答案，然后对累积的结果进行处理。

这就是为什么Queues看起来更快的原因：它们涉及一种不需要“管理”并因此锁定/解锁的对象的工作。

仅“管理”流程之间绝对需要共享的内容。您的托管列表包含一些看起来非常复杂的对象......

更快的范例是：

allwork = manager.list([a, b,c])
theresult = manager.list()

然后

while mywork:
    unitofwork = allwork.pop()
    theresult = myfunction(unitofwork)

Answer 3

如果您不需要复杂的共享对象，那么只使用可以想象的最简单对象的列表。

然后告诉工人们获取他们可以在自己的小世界中处理的复杂数据。

尝试：

allwork = manager.list([datasetid1, datasetid2 ,...])
theresult = manager.list()

while mywork:
    unitofworkid = allwork.pop()
    theresult = myfunction(unitofworkid)

def myfunction(unitofworkid):
    thework = acquiredataset(unitofworkid)
    result = holdout_trial_compare_datasets(thework, ...)

我希望这是有道理的。在这个方向上进行重构不应该花费太多时间。而你应该看到{方法'获取''_thread.lock'对象}的数字在你描述时会像摇滚一样掉落。

为什么我没有通过Python中的多处理看到加速？

3 个答案: