作为用于数据分析和数值计算的Python的用户,而不是真正的“编码器”,我一直缺少一种在几个核心上分配令人尴尬的并行循环计算的低开销方式。 据我所知,Numba曾经有过prange构造,但它是abandoned because of "instability and performance issues"。
使用newly open-sourced @guvectorize装饰器,我找到了一种方法,可以将它用于后期prange功能的几乎无开销模拟。
我很高兴现在有了这个工具,感谢Continuum Analytics的工作人员,并且没有在网上找到任何明确提及@guvectorize的使用。尽管对于之前使用过NumbaPro的人来说可能是微不足道的,但是我会在那里为所有那些非编程人员发布这个(请参阅我对这个“问题”的回答)。
答案 0 :(得分:-1)
考虑下面的例子,其中一个两级嵌套for循环,其核心进行一些涉及两个输入数组的数值计算和一个循环索引的函数,以四种不同的方式执行。每个变体都使用Ipython的%timeit
魔法:
target = "cpu"
)target = "parallel"
)最后一个完成是因为(在这个特定的例子中)内部循环的范围取决于外部循环索引的值。我不知道gufunc调用的调度是如何组织在numpy中的,但似乎“并行”循环索引的随机化实现了稍微好一点的负载平衡。
在我的(慢速)机器上(第一代核心i5,2核心,4个超线程)我得到了时间:
1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop
注意:如果这个配方很容易适用于target =“gpu”(我应该这样做,但我现在无法访问合适的显卡),我会感兴趣,以及什么是加速。请发帖!
以下是这个例子:
import numpy as np
from numba import jit, guvectorize, float64, int64
@jit
def naive_for_loop(some_input_array, another_input_array, result):
for i in range(result.shape[0]):
for k in range(some_input_array.shape[0] - i):
result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
@guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
@guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
arg_size = 20000
input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)
# do single-threaded naive nested for loop
# reset result_array inside %timeit call
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()
# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()
# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()
# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()
# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))