Question

对于C ++，我们可以使用OpenMP进行并行编程;但是，OpenMP不适用于Python。如果我想并行我的python程序的某些部分，我该怎么办？

代码的结构可以被视为：

 solve1(A)
 solve2(B)

solve1和solve2是两个独立的功能。如何并行运行这种代码而不是按顺序运行以减少运行时间？希望可以有人帮帮我。首先十分感谢。代码是：

def solve(Q, G, n):
    i = 0
    tol = 10 ** -4

    while i < 1000:
        inneropt, partition, x = setinner(Q, G, n)
        outeropt = setouter(Q, G, n)

        if (outeropt - inneropt) / (1 + abs(outeropt) + abs(inneropt)) < tol:
            break

        node1 = partition[0]
        node2 = partition[1]

        G = updateGraph(G, node1, node2)

        if i == 999:
            print "Maximum iteration reaches"
    print inneropt

其中setinner和setouter是两个独立的函数。这就是我想要并行的地方......

Answer 1

您可以使用multiprocessing模块。对于这种情况，我可能会使用处理池：

from multiprocessing import Pool
pool = Pool()
result1 = pool.apply_async(solve1, [A])    # evaluate "solve1(A)" asynchronously
result2 = pool.apply_async(solve2, [B])    # evaluate "solve2(B)" asynchronously
answer1 = result1.get(timeout=10)
answer2 = result2.get(timeout=10)

这将生成可以为您执行通用工作的进程。由于我们没有通过processes，它将为您机器上的每个CPU内核生成一个进程。每个CPU内核可以同时执行一个进程。

如果要将列表映射到单个函数，可以执行以下操作：

args = [A, B]
results = pool.map(solve1, args)

不要使用线程，因为GIL会锁定python对象上的任何操作。

Answer 2

这可以通过Ray非常优雅地完成。

要并行化您的示例，您需要使用@ray.remote装饰器定义您的函数，然后使用.remote调用它们。

import ray

ray.init()

# Define the functions.

@ray.remote
def solve1(a):
    return 1

@ray.remote
def solve2(b):
    return 2

# Start two tasks in the background.
x_id = solve1.remote(0)
y_id = solve2.remote(1)

# Block until the tasks are done and get the results.
x, y = ray.get([x_id, y_id])

这比multiprocessing模块有很多优点。

相同的代码将在多核计算机和一组计算机上运行。
流程通过shared memory and zero-copy serialization有效地共享数据。
错误消息传播得很好。

这些函数调用可以组合在一起，例如，

@ray.remote
def f(x):
    return x + 1

x_id = f.remote(1)
y_id = f.remote(x_id)
z_id = f.remote(y_id)
ray.get(z_id)  # returns 4

除了远程调用函数外，还可以远程实例化类actors。

请注意Ray是我一直在帮助开发的框架。

Answer 3

CPython使用Global Interpreter Lock，这使得并行编程比C ++更有趣

本主题有几个有用的示例和对挑战的描述：

Python Global Interpreter Lock (GIL) workaround on multi-core systems using taskset on Linux?

Answer 4

您可以使用joblib库进行并行计算和多处理。

from joblib import Parallel, delayed

您可以简单地创建要并行运行的函数foo，并根据以下代码实现并行处理：

output = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in input)

可以从num_cores库中获取multiprocessing的地方，如下所示：

import multiprocessing

num_cores = multiprocessing.cpu_count()

如果您的函数具有多个输入参数，并且只想通过列表迭代其中一个参数，则可以按如下方式使用partial库中的functools函数：

from joblib import Parallel, delayed
import multiprocessing
from functools import partial
def foo(arg1, arg2, arg3, arg4):
    '''
    body of the function
    '''
    return output
input = [11,32,44,55,23,0,100,...] # arbitrary list
num_cores = multiprocessing.cpu_count()
foo_ = partial(foo, arg2=arg2, arg3=arg3, arg4=arg4)
# arg1 is being fetched from input list
output = Parallel(n_jobs=num_cores)(delayed(foo_)(i) for i in input)

您可以通过几个示例here找到有关python和R多处理的完整说明。

Answer 5

正如其他人所说，解决方案是使用多个过程。但是，哪种框架更合适取决于许多因素。除了已经提到的那些，还有charm4py和mpi4py（我是charm4py的开发人员）。

有一个比使用工作池抽象更有效的方法来实现上述示例。主循环在1000次迭代中的每一次都反复向工作者发送相同的参数（包括完整的图形G）。由于至少一个工作进程将驻留在不同的进程上，因此这涉及将参数复制并发送到其他进程。根据对象的大小，这可能会非常昂贵。相反，让工作人员存储状态并仅发送更新的信息才有意义。

例如，在charm4py中，可以这样进行：

class Worker(Chare):

    def __init__(self, Q, G, n):
        self.G = G
        ...

    def setinner(self, node1, node2):
        self.updateGraph(node1, node2)
        ...


def solve(Q, G, n):
    # create 2 workers, each on a different process, passing the initial state
    worker_a = Chare(Worker, onPE=0, args=[Q, G, n])
    worker_b = Chare(Worker, onPE=1, args=[Q, G, n])
    while i < 1000:
        result_a = worker_a.setinner(node1, node2, ret=True)  # execute setinner on worker A
        result_b = worker_b.setouter(node1, node2, ret=True)  # execute setouter on worker B

        inneropt, partition, x = result_a.get()  # wait for result from worker A
        outeropt = result_b.get()  # wait for result from worker B
        ...

请注意，在此示例中，我们实际上只需要一个工人。主循环可以执行其中一个功能，而工人可以执行另一个功能。但是我的代码有助于说明以下几点：

工人A在进程0中运行（与主循环相同）。 result_a.get()被阻止等待结果时，工作程序A在同一进程中进行计算。
参数通过引用自动传递给工作程序A，因为它位于相同的位置处理（不涉及复制）。

Answer 6

在某些情况下，可以使用Numba自动并行化循环，尽管它仅适用于一小部分Python：

keras.applications.MobileNetV2

不幸的是，似乎Numba仅适用于Numpy数组，不适用于其他Python对象。从理论上讲，虽然我还没有尝试过，但compile Python to C++然后再automatically parallelize it using the Intel C++ compiler也是可能的。

如何在Python中进行并行编程

6 个答案: