Question

我假设如果我使用nogil指令在Cython中编写代码，那确实会绕过gil而我可以使用ThreadPoolExecutor来使用多个内核。或者，更有可能的是，我在实现中搞砸了一些东西，但我似乎无法弄清楚是什么。

我使用Barnes-Hut算法编写了一个简单的n-body模拟，并希望并行执行查找：

# cython: boundscheck=False
# cython: wraparound=False
...

def estimate_forces(self, query_point):
    ...
    cdef np.float64_t[:, :] forces

    forces = np.zeros_like(query_point, dtype=np.float64)
    estimate_forces_multiple(self.root, query_point.data, forces, self.theta)

    return np.array(forces, dtype=np.float64)


cdef void estimate_forces_multiple(...) nogil:
    for i in range(len(query_points)):
        ...
        estimate_forces(cell, query_point, forces, theta)

我将代码称为：

data = np.random.uniform(0, 100, (1000000, 2))

executor = ThreadPoolExecutor(max_workers=max_workers)

quad_tree = QuadTree(data)

chunks = np.array_split(data, max_workers)
forces = executor.map(quad_tree.estimate_forces, chunks)
forces = np.vstack(list(forces))

我省略了很多代码，以使有问题的代码更清晰。我的理解是，增加max_workers应该使用多个核心并提供大幅加速，但是，情况似乎并非如此：

> time python barnes_hut.py --max-workers 1
python barnes_hut.py  9.35s user 0.61s system 106% cpu 9.332 total

> time python barnes_hut.py --max-workers 2
python barnes_hut.py  9.05s user 0.64s system 107% cpu 9.048 total

> time python barnes_hut.py --max-workers 4
python barnes_hut.py  9.08s user 0.64s system 107% cpu 9.035 total

> time python barnes_hut.py --max-workers 8
python barnes_hut.py  9.12s user 0.71s system 108% cpu 9.098 total

构建四叉树的时间不到1秒，因此大部分时间花费在estimate_forces_multiple上，但很明显，我没有使用多个线程加速。查看top，它似乎也没有使用多个核心。

我的猜测是，我一定错过了一些非常关键的东西，但我无法弄清楚是什么。

Answer 1

我错过了一个关键部分，实际上已经发出了释放GIL的信号：

def estimate_forces(self, query_point):
    ...
    cdef np.float64_t[:, :] forces

    forces = np.zeros_like(query_point, dtype=np.float64)
    # HERE
    cdef DTYPE_t[:, :] query_points = query_point.data
    with nogil:
        estimate_forces_multiple(self.root, query_points, forces, self.theta)

    return np.array(forces, dtype=np.float64)

我还发现UNIX time命令没有做我想要的多线程程序并报告相同的数字（我猜它报告了CPU时间？）。使用pythons timeit提供了预期的结果：

max_workers=1: 91.2366s
max_workers=2: 36.7975s
max_workers=4: 30.1390s
max_workers=8: 24.0240s

使用ThreadPoolExecutor的Cython nogil不提供加速

1 个答案: