出于学习目的,我编写了一个小型C Python模块,该模块应该执行IPC cuda memcopy以在进程之间传输数据。为了测试,我写了相同的程序:一个使用theano' CudaNdarray,另一个使用pycuda。问题是,即使测试程序几乎完全相同,pycuda版本也可以,而theano版本则不然。它不会崩溃:它只会产生不正确的结果。
以下是C模块中的相关功能。它的作用如下:每个进程都有两个缓冲区:源和目标。调用_sillycopy(source,dest,n)将来自每个进程的源缓冲区的n个元素复制到相邻进程的dest数组。因此,如果我有两个进程0和1,进程0将以进程1的源缓冲区结束,进程1将以进程0的源缓冲区结束。
请注意,要在进程之间传输cudaIpcMemHandle_t值,我使用MPI(这是使用MPI的较大项目的一小部分)。 _sillycopy被另一个函数调用," sillycopy"这是通过标准的Python C API方法在Python中公开的。
void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
int localRank;
int localSize;
MPI_Comm_rank(comm, &localRank);
MPI_Comm_size(comm, &localSize);
// Figure out which process is to the "left".
// m() performs a mod and treats negative numbers
// appropriately
int neighbor = m(localRank - 1, localSize);
// Create a memory handle for *source and do a
// wasteful Allgather to distribute to other processes
// (could just use an MPI_Sendrecv, but irrelevant right now)
cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
cudaIpcGetMemHandle(memHandles + localRank, source);
MPI_Allgather(
memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
comm);
// Open the neighbor's mem handle so we can do a cudaMemcpy
float *sourcePtr;
cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);
// Copy!
cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
cudaIpcCloseMemHandle(sourcePtr);
delete [] memHandles;
}
现在这里是pycuda的例子。作为参考,在a_gpu和b_gpu上使用int()会返回指向设备上底层缓冲区内存地址的指针。
import sillymodule # sillycopy lives in here
import simplempi as mpi
import pycuda.driver as drv
import numpy as np
import atexit
import time
mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)
# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)
# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)
# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)
# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])
sillymodule.sillycopy(
int(a_gpu),
int(b_gpu),
shape[0])
# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu)
print(c_p[0:10])
现在上面代码的theano版本。而不是使用int()来获取缓冲区'地址,CudaNdarray访问它的方式是通过gpudata属性。
import os
import simplempi as mpi
mpi.init()
# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule
shape = (2 ** 24, )
# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)
# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]
# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))
sillymodule.sillycopy(
a_gpu.gpudata,
b_gpu.gpudata,
shape[0])
# Should print 1's
print(np.asarray(b_gpu[0:10]))
同样,pycuda代码完美运行并且theano版本运行,但是给出了错误的结果。确切地说,在theano代码的末尾,b_gpu充满了垃圾:既不是1也不是0,只是随机数,就好像它是从内存中的错误位置复制一样。
我关于为什么失败的原始理论与CUDA背景有关。我想知道是否有可能theano正在对他们做一些事情,这意味着在sillycopy中进行的cuda调用是在不同于用于创建gpu数组的CUDA环境下运行的。我不认为这是因为:
一个次要的想法是,这是否必须确实theano产生了几个线程,即使使用cuda后端,这可以通过运行" ps huH p"来验证。我不知道线程如何影响任何事情,但我已经没有明显的事情要考虑了。
对此的任何想法都将非常感谢!
供参考:流程以正常的OpenMPI方式启动:
mpirun --np 2 python test_pycuda.py