Question

我已经在Tensorflow中使用GPU CUDA内核实现了一个相当复杂的新Op。这个Op需要大量的动态内存分配变量，这些变量不是张量的，并且在op完成后被释放，更具体地说它涉及使用哈希表。

现在我正在使用cudaMalloc()和cudaFree()，但我注意到Tensorflow有自己的类型Eigen::GPUDevice，它能够在GPU上分配和释放内存。

我的问题：

最佳做法是使用Eigen::GPUDevice来管理GPU内存;
使用Eigen::GPUDevice代替CUDA API我自动＆＃34;启用多GPU支持，因为可以将不同的GPUDevices传递给Op;
我应该将这个想法扩展到CPU内核，看看是否有CPUDevice类型也管理内存而不是使用C ++语法（即auto var = new int[100]; delete[] var）

Answer 1

这个问题没有直接的公共指南。我通常只是让TensorFlow通过

分配这些信息

cudaMalloc

无论需要什么内存，都应该由TensorFlow上下文分配，而不是通过自定义new type[num]或REGISTER_OP来分配。
上下文应提供分配器的信息
见下文

为简单起见，考虑添加两个矩阵（full example）。 TensorFlow-Operations通常包含以下结构：

操作说明 Tensor* output = nullptr; Tensor* tmp_var = nullptr; OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output)); OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var)); // the function does not need to care about the memory allocation as everything is already setup at this point ::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);，负责形状检查，并设置输出形状（example）

OpKernel 负责分配内存，获取指向输入和设置内容的指针，（见上文或this）

Functor 用于实现本身，例如

    // gpu version
    template <typename Dtype>
    struct MyFunctor<GPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

    // cpu version
    template <typename Dtype>
    struct MyFunctor<CPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

您刚刚执行

Compute

修改

allocate_persistent：如果您需要Op调用之间的数据（如一次性索引结构），请使用此方法。[example]

allocate_temp只是tmp内存，它不会在{{1}}方法生命周期结束时保留。 [example]

但我强烈建议您阅读source-code here中的评论，然后根据您的使用情况决定。

Answer 2

最佳做法是使用OpKernelContext::allocate_persistent()方法以tensorflow::Tensor的形式分配内存，该内存比OpKernel::Compute()的单个调用更长。它为设备使用适当的Allocator*，因此如果内核在GPU设备上运行，它将为该特定设备分配GPU内存，如果它在CPU设备上运行，它将分配CPU内存。

Tensorflow新的Op CUDA内核内存管理

2 个答案: