Question

正如我在Topic中由DavidW推荐的那样，我正在尝试使用OpenMP制作C包装函数，以对Cython代码进行多线程处理。

这是我所拥有的：

C文件“ paral.h”：

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>


void paral(void (*func)(int,int), int nthreads){
    int t;
    #pragma omp parallel for
    for (t = 0; t < nthreads; t++){
        (*func)(t, nthreads);
    }
}

test.pyx文件：

import time
import random
cimport cython
from libc.stdlib cimport malloc, realloc, free

ctypedef void (*func)(int,int)

cdef extern from "paral.h":
    void paral(func function, int nthreads) nogil

cdef double *a = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *b = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *c = <double *> malloc ( 1000000 * sizeof(double) )

cdef int i
for i in range(1000000):
    a[i] = random.random()
    b[i] = random.random()

cdef void sum_ab(int thread, int nthreads):
    cdef int start, stop, i
    start = thread * (1000000 / nthreads)
    stop = start + (1000000 / nthreads)
    for i in range(start, stop):
        c[i] = a[i] + b[i]

t0 = time.clock()
with nogil:
    paral(sum_ab,4)
print(time.clock()-t0)

t0 = time.clock()
with nogil:
    paral(sum_ab,1)
print(time.clock()-t0)

我有Visual Studio，因此在setup.py中我添加了：

extra_compile_args=["/openmp"],
extra_link_args=["/openmp"]

结果： 4线程比1线程略慢。如果有人知道我在这里做错了。

编辑：

回应祖尔坦。

为了确保由time.clock（）测量的时间是正确的，我使执行过程持续几秒钟，以便能够将我获得的时间与time.clock（）进行比较，并比较我用stopwtach测量的时间。像这样的东西：

print("start timer 1")

t1 = time.clock()
for i in range(10000):
    with nogil:
        paral(sum_ab,4)
t2 = time.clock()

print(t2-t1)
print("strart timer 2")

t1 = time.clock()
for i in range(10000):
    with nogil:
        paral(sum_ab,1)
t2 = time.clock()

print(t2-t1)
print("stop")

time.clock（）的结果是15.0s 4线程，14.5s 1线程，我发现测量结果没有明显的差异。

修改2： 我想我已经知道这里发生了什么。在某些情况下，我读到内存带宽可能会饱和。如果我更换：

c[i] = a[i] + b[i]

通过更复杂的操作，例如：

c[i] = a[i]**b[i]

现在我在单线程和多线程之间（x2附近）有了显着的加速。

但是，我仍然比经典的prange循环慢2倍！我认为没有任何理由使prange这么快。也许我需要更改C代码...

cython代码的并行C包装器

0 个答案: