Question

以下是我的工作代码供参考：

vector = numpy.array([1, 2, 4, 8], numpy.float32) #cl.array.vec.float4
matrix = numpy.zeros((1, 4), cl.array.vec.float4)
matrix[0, 0] = (1, 2, 4, 8)
matrix[0, 1] = (16, 32, 64, 128)
matrix[0, 2] = (3, 6, 9, 12)
matrix[0, 3] = (5, 10, 15, 25)
# vector[0] = (1, 2, 4, 8)


platform=cl.get_platforms() #gets all platforms that exist on this machine
device=platform[0].get_devices(device_type=cl.device_type.GPU) #gets all GPU's that exist on first platform from platform list
context=cl.Context(devices=[device[0]]) #Creates context for all devices in the list of "device" from above. context.num_devices give number of devices in this context
print("everything good so far")
program=cl.Program(context,"""
__kernel void matrix_dot_vector(__global const float4 * matrix,__global const float *vector,__global float *result)
{
int gid = get_global_id(0);

result[gid]=dot(matrix[gid],vector[0]);
}

""" ).build()
queue=cl.CommandQueue(context)
# queue=cl.CommandQueue(context,cl_device_id device) #Context specific to a device if we plan on using multiple GPUs for parallel processing

mem_flags = cl.mem_flags
matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
vector_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector)
matrix_dot_vector = numpy.zeros(4, numpy.float32)
global_size_of_GPU= 0
destination_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix_dot_vector.nbytes)
# threads_size_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, global_size_of_GPU.nbytes)
program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buf)

## Step #11. Move the kernel’s output data to host memory.
cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)
# cl.enqueue_copy(queue, global_size_of_GPU, threads_size_buf)
print(matrix_dot_vector)
# print(global_size_of_GPU)

# COPY SAME ARRAY FROM GPU AGAIN
cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)
print(matrix_dot_vector)
print('copied same array twice')

如何在matrix_buf＆amp;中释放内存？ GPU上的destination_buf。一个是只读的，另一个是只写的。
如何在同一个matrix_buf中加载不同的矩阵数组，而不是必须在pyopencl中创建新的缓冲区。我读过如果我加载新的相同缓冲区中的数据比重新创建相同大小要快得多每次缓冲。
如果我在旧缓冲区加载新数组，是否可以的大小比该缓冲区中的旧数组小。不新数组必须与缓冲区的大小完全相同吗？

Answer 1

Re 1.我相信当缓冲区的变量超出范围时，缓冲区将被释放，或者您可以显式调用release()。在这种情况下，缓冲区是读还是写只是不重要。
Re 2.尝试pyopencl.enqueue_map_buffer()，它返回对可以从主机端修改的数组的访问权限。更多here。
Re 3.如果您想重用现有缓冲区并使用其中的一部分，那就没问题。在内核方面，您可以控制要访问的部分。

Answer 2

matrix_buf.release（）＆amp; destination_buf.release（） - 这将释放为GPU中各个缓冲区分配的内存。如果它没有用，它最好释放内存，以避免遇到内存错误。如果GPU功能退出，则所有GPU内存都会被pyopencl自动清除。 - {by doqtor}
cl.enqueue_copy（queue，matrix_buf，matrix_2） - 将新的matrix_2数组加载到matrix_buf中，而无需重新创建新的矩阵buf。
可以重用现有缓冲区并使用其中的一部分。在内核方面，我们可以控制我们想要访问的部分。 - {by doqtor}

如何释放GPU内存和在Pyopencl中为不同的数组使用相同的缓冲区？

2 个答案: