我正在尝试按照website上的说明创建一个新的tensorflow GPU操作。
看看他们的例子,似乎他们直接将C ++指针提供给CUDA内核,而没有分配设备内存并将主机指针的内容复制到设备指针。
根据我对CUDA的理解,你总是需要在设备上分配内存,然后在内核中使用设备指针。
我错过了什么?我检查input_tensor.flat<T>().data()
应该返回一个常规的C ++指针。这是我所指代码的副本:
// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
}
}
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
//
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
ExampleCudaKernel<T>
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA
答案 0 :(得分:2)
当您在https://www.tensorflow.org/extend/adding_an_op处查看此代码行时,您会看到分配是在kernel_example.cc中完成的:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data());
}
在 context-&gt; allocate_output(....)中,他们将对输出Tensor的引用移交,然后分配。上下文知道它是在GPU还是CPU上运行,并分别在主机或设备上分配张量。移交给CUDA的指针只指向Tensor类中的实际数据。