Question

我试图使我的代码并行，但遇到了一件我无法解释的奇怪事情。

让我定义上下文。我要做的工作非常繁重，读取多个文件，对其进行机器学习分析，涉及很多数学。当我按顺序执行代码时，我的代码可以在Windows和Linux上正常运行，但是当我尝试使用多处理时，一切都会中断。下面是我首先在Windows上开发的示例：

from multiprocessing.dummy import Pool as ThreadPool 

def ppp(element):
    window,day = element
    print(window,day)
    time.sleep(5)
    return

if __name__ == '__main__'    
    #%% Reading datasets
    print('START')
    start_time = current_milli_time()
    tree = pd.read_csv('datan\\days.csv')
    days = list(tree.columns)
    # to be able to run this code uncomment the following line and comment the previous two
    # days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
    windows = [1000]
    processes_args = list(itertools.product(windows, days))

    pool = ThreadPool(8) 
    results = pool.map_async(ppp, processes_args)
    pool.close() 
    pool.join() 
    print('END', current_milli_time()-start_time, 'ms')

当我在Windows上运行此代码时，输出如下所示：

START
100010001000 1000 1000100010001000      081008120808
08130814
0818
082708171000
1000    
  08290828

END 5036 ms

在125毫秒内打印出混乱的图像。在Linux上也具有相同的行为。但是，我注意到，如果我在Linux上应用此方法，并且查看“ htop”，我看到的是一组随机选择执行的线程，但它们从未并行执行。因此，在进行一些Google搜索之后，我想到了以下新代码：

from multiprocessing import Pool as ProcessPool

def ppp(element):
    window,day = element
    print(window,day)
    time.sleep(5)
    return

if __name__ == '__main__':
    #%% Reading datasets
    print('START')
    start_time = current_milli_time()
    tree = pd.read_csv('datan\\days.csv')
    days = list(tree.columns)
    # to be able to run this code uncomment the following line and comment the previous two
    # days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
    windows = [1000]
    processes_args = list(itertools.product(windows, days))

    pool = ProcessPool(8) 
    results = pool.map_async(ppp, processes_args)
    pool.close() 
    pool.join() 
    print('END', current_milli_time()-start_time, 'ms')

如您所见，我更改了import语句，该语句基本上创建了一个进程池而不是线程池。这就解决了Linux上的问题，实际上，在实际情况下，我有8个处理器以100％的速度运行，而系统中有8个进程在运行。输出看起来像以前的一样。但是，当我在Windows上使用此代码时，整个运行需要10秒钟以上的时间，而且我没有得到ppp的任何印刷品，只是主要印刷品。

我确实试图寻找一种解释，但是我不明白为什么会这样。例如，在这里Python multiprocessing Pool strange behavior in Windows，他们谈论Windows上的安全代码，而答案则建议转向Threading，作为副作用，它将使代码不是并行的而是并发的。这是另一个示例：Python multiprocessing linux windows difference。所有这些问题都描述了fork()和spawn的过程，但是我个人认为我的问题不是这样。 Python文档仍然说明Windows没有fork()方法（https://docs.python.org/2/library/multiprocessing.html#programming-guidelines）。

总而言之，我现在确信我无法在Windows中进行并行处理，但是我认为我从所有这些讨论中得出的结论都是错误的。因此，我的问题应该是：是否可以在Windows中并行运行进程或线程（在不同的CPU上）？

编辑：在两个示例中都添加名称== main

EDIT2：为了能够运行此功能的代码，并且需要这些导入：

import time
import itertools    
current_milli_time = lambda: int(round(time.time() * 1000))

Answer 1

在Windows下，python在多处理模块中使用pickle / unpickle来模仿fork，在执行unpickle时，该模块将被重新导入，全局范围内的所有代码将再次执行，the docs说：

相反，应该使用if name ==' main '
来保护程序的“入口点”

此外，您应该假定AsyncResult返回的pool.map_async，或者简单地使用pool.map。

Answer 2

您可以在Windows下进行并行处理（我正在运行一个脚本，现在正在执行大量计算并使用所有8个内核的100％），但是它的工作方式是通过创建并行 processes < / em>，而不是线程（除了I / O操作外，由于GIL而不能工作）。一些要点：

您需要使用concurrent.futures.ProcessPoolExecutor()（请注意，它是进程池而不是线程池）。参见https://docs.python.org/3/library/concurrent.futures.html。简而言之，它的工作方式是将要并行化的代码放入函数中，然后调用executor.map()来完成拆分。

请注意，在Windows上，每个并行进程都将从头开始，因此您可能需要在一些地方使用if __name__ == '__main__:'来区分主进程和其他进程。您在主脚本中加载的数据将被复制到子进程，因此必须是可序列化的（在Python语言中可提取）。

为了有效使用内核，请避免将数据写入跨所有进程共享的对象（例如，进度计数器或通用数据结构）。否则，进程之间的同步将影响性能。因此，请从任务管理器监视执行情况。

Python并行处理-Linux和Windows之间的行为不同

2 个答案: