Question

这是一个简单的pyopencl copy_if（）示例。

首先，让我们创建一个大集合（2 ^ 25）的随机整数，并查询低于500,000阈值的那些：

import pyopencl as cl
import numpy as np
import my_pyopencl_algorithm
import time

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

from pyopencl.clrandom import rand as clrand
random_gpu = clrand(queue, (2^25,), dtype=np.int32, a=0, b=10**6)

start = time.time()
final_gpu, count_gpu, evt = my_pyopencl_algorithm.copy_if(random_gpu, "ary[i] < 500000", queue = queue)
final = final_gpu.get()
count = int(count_gpu.get())
print '\ncopy_if():\nresults=',final[:count], '\nfound=', count, '\ntime=', (time.time()-start), '\n========\n'

您可能已经注意到我没有调用pyopencl的copy_if，而是调用它的一个分支（my_pyopencl_algorithm.copy_if）。可以找到pyopencl.algorithm.py的分支here。

copy_if的美妙之处在于你有一个现成的所需输出计数，以及从gid = 0到gid = count的顺序。看起来不是最优的是它分配并返回（从gpu）整个缓冲区，只有第一个条目有意义。所以在我的fork of pyopencl.algorithm.py我试图优化返回缓冲区大小，我得到了这个：

def sparse_copy_if(ary, predicate, extra_args=[], preamble="", queue=None, wait_for=None):
    """Copy the elements of *ary* satisfying *predicate* to an output array.

:arg predicate: a C expression evaluating to a `bool`, represented as a string.
    The value to test is available as `ary[i]`, and if the expression evaluates
    to `true`, then this value ends up in the output.
:arg extra_args: |scan_extra_args|
:arg preamble: |preamble|
:arg wait_for: |explain-waitfor|
:returns: a tuple *(out, count, event)* where *out* is the output array, *count*
    is an on-device scalar (fetch to host with `count.get()`) indicating
    how many elements satisfied *predicate*, and *event* is a
    :class:`pyopencl.Event` for dependency management. *out* is allocated
    to the same length as *ary*, but only the first *count* entries carry
    meaning.

.. versionadded:: 2013.1
"""
if len(ary) > np.iinfo(np.int32).max:
    scan_dtype = np.int64
else:
    scan_dtype = np.int32

extra_args_types, extra_args_values = extract_extra_args_types_values(extra_args)


knl = _copy_if_template.build(ary.context,
        type_aliases=(("scan_t", scan_dtype), ("item_t", ary.dtype)),
        var_values=(("predicate", predicate),),
        more_preamble=preamble, more_arguments=extra_args_types)
out = cl.array.empty_like(ary)
count = ary._new_with_changes(data=None, offset=0,
        shape=(), strides=(), dtype=scan_dtype)

# **dict is a Py2.5 workaround
evt = knl(ary, out, count, *extra_args_values,
        **dict(queue=queue, wait_for=wait_for))

'''
Now I need to copy the first num_results values from out to final_gpu (in which buffer size is minimized)
'''

prg = cl.Program(ary.context, """ 
__kernel void copy_final_results(__global int *final_gpu, __global int *out_gpu) 
{ 
__private uint gid; 
gid = get_global_id(0); 
final_gpu [gid] = out_gpu [gid]; 
} 
""").build() 

num_results= int(count.get())

final_gpu = pyopencl.array.zeros(queue, (num_results,), dtype=scan_dtype)

prg.copy_final_results (queue, (num_results,), None, final_gpu.data, out.data).wait()  

return final_gpu, evt 
#return out, count, evt

也就是说，我创建一个与输出大小完全相同的final_gpu缓冲区，然后将有意义的条目复制到它，然后返回它。

如果我现在跑：

start = time.time()
final_gpu, evt = my_pyopencl_algorithm.sparse_copy_if(random_gpu, "ary[i] < 500000", queue = queue)
final = final_gpu.get()
print '\ncopy_if_2():\nresults=',final, '\nfound=', count, '\ntime=', (time.time()-start) here

......这似乎可以提高数量级的速度。结果越稀疏，它就越快，因为要传输的缓冲区大小（具有高延迟）被最小化。

我的问题是：我们有没有理由返回一个全尺寸的缓冲区？换句话说，我是在介绍任何错误，还是应该提交补丁？

pyopencl copy_if（）：是否可以最小化返回缓冲区大小？

0 个答案: