我似乎无法通过以下代码实现并发:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.tools import DeviceMemoryPool as DMP
from pycuda.compiler import SourceModule
data_DevMemPool = DMP()
some_long_running_kernel = SourceModule(some_long_running_kernel_SRC, no_extern_c=True)
stream = []
for k in range (10):
stream.append(drv.Stream())
numpy_data = np.zeros((2048,4000)).astype(np.float32)
#Why won't this parallelize?:
for i in range(10):
gpu_data = gpuarray.to_gpu_async(numpy_data,allocator = data_DevMemPool.allocate,stream=stream[i] )
some_long_running_kernel(
gpu_data,
block=(1024,1,1),grid=(2,1,1),stream=stream[i] )
随后运行:
data_DevMemPool.held_blocks
data_DevMemPool.active_blocks
分别显示值1和1,这表明设备内存池在任何时候都没有扩展到超过1,因为如果实现了并发性,将会发生这种情况。尽管这两个GPU指令(同时提供了 gpuarray.to_gpu_async()和 some_long_running_kernel())都已提供了流)。