这是一个简单的pyopencl copy_if()示例。
首先,让我们创建一个大集合(2 ^ 25)的随机整数,并查询低于500,000阈值的那些:
import pyopencl as cl
import numpy as np
import my_pyopencl_algorithm
import time
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
from pyopencl.clrandom import rand as clrand
random_gpu = clrand(queue, (2^25,), dtype=np.int32, a=0, b=10**6)
start = time.time()
final_gpu, count_gpu, evt = my_pyopencl_algorithm.copy_if(random_gpu, "ary[i] < 500000", queue = queue)
final = final_gpu.get()
count = int(count_gpu.get())
print '\ncopy_if():\nresults=',final[:count], '\nfound=', count, '\ntime=', (time.time()-start), '\n========\n'
您可能已经注意到我没有调用pyopencl的copy_if,而是调用它的一个分支(my_pyopencl_algorithm.copy_if)。可以找到pyopencl.algorithm.py的分支here。
copy_if的美妙之处在于你有一个现成的所需输出计数,以及从gid = 0到gid = count的顺序。看起来不是最优的是它分配并返回(从gpu)整个缓冲区,只有第一个条目有意义。所以在我的fork of pyopencl.algorithm.py我试图优化返回缓冲区大小,我得到了这个:
def sparse_copy_if(ary, predicate, extra_args=[], preamble="", queue=None, wait_for=None):
"""Copy the elements of *ary* satisfying *predicate* to an output array.
:arg predicate: a C expression evaluating to a `bool`, represented as a string.
The value to test is available as `ary[i]`, and if the expression evaluates
to `true`, then this value ends up in the output.
:arg extra_args: |scan_extra_args|
:arg preamble: |preamble|
:arg wait_for: |explain-waitfor|
:returns: a tuple *(out, count, event)* where *out* is the output array, *count*
is an on-device scalar (fetch to host with `count.get()`) indicating
how many elements satisfied *predicate*, and *event* is a
:class:`pyopencl.Event` for dependency management. *out* is allocated
to the same length as *ary*, but only the first *count* entries carry
meaning.
.. versionadded:: 2013.1
"""
if len(ary) > np.iinfo(np.int32).max:
scan_dtype = np.int64
else:
scan_dtype = np.int32
extra_args_types, extra_args_values = extract_extra_args_types_values(extra_args)
knl = _copy_if_template.build(ary.context,
type_aliases=(("scan_t", scan_dtype), ("item_t", ary.dtype)),
var_values=(("predicate", predicate),),
more_preamble=preamble, more_arguments=extra_args_types)
out = cl.array.empty_like(ary)
count = ary._new_with_changes(data=None, offset=0,
shape=(), strides=(), dtype=scan_dtype)
# **dict is a Py2.5 workaround
evt = knl(ary, out, count, *extra_args_values,
**dict(queue=queue, wait_for=wait_for))
'''
Now I need to copy the first num_results values from out to final_gpu (in which buffer size is minimized)
'''
prg = cl.Program(ary.context, """
__kernel void copy_final_results(__global int *final_gpu, __global int *out_gpu)
{
__private uint gid;
gid = get_global_id(0);
final_gpu [gid] = out_gpu [gid];
}
""").build()
num_results= int(count.get())
final_gpu = pyopencl.array.zeros(queue, (num_results,), dtype=scan_dtype)
prg.copy_final_results (queue, (num_results,), None, final_gpu.data, out.data).wait()
return final_gpu, evt
#return out, count, evt
也就是说,我创建一个与输出大小完全相同的final_gpu缓冲区,然后将有意义的条目复制到它,然后返回它。
如果我现在跑:
start = time.time()
final_gpu, evt = my_pyopencl_algorithm.sparse_copy_if(random_gpu, "ary[i] < 500000", queue = queue)
final = final_gpu.get()
print '\ncopy_if_2():\nresults=',final, '\nfound=', count, '\ntime=', (time.time()-start) here
......这似乎可以提高数量级的速度。结果越稀疏,它就越快,因为要传输的缓冲区大小(具有高延迟)被最小化。
我的问题是:我们有没有理由返回一个全尺寸的缓冲区?换句话说,我是在介绍任何错误,还是应该提交补丁?