Question

我正在将计算颗粒对之间的接触力的c ++ 11程序转换为cuda程序。所有粒子对彼此独立。我使用函子来计算接触力。该函子执行许多计算并包含许多成员变量。因此，我试图重用函子，而不是每个粒子对都创建一个新的函子。

由于函子包含虚拟功能，因此函子克隆是在设备而不是主机上完成的。

我正在考虑这样的计划：

1）克隆M个函子

2）开始计算M个粒子对

3）粒子对M + 1等到一个粒子对完成后再使用其函子

但是，其他想法也很受欢迎。

我已经对该程序做了一个非常简化的版本。在此播放程序中，F变量不必是成员变量，而在实际程序中则必须是成员变量。实际程序中还有很多成员数据和粒子对（N）。 N通常是几百万。

#include <stdio.h>

#define TPB 4 // realistic value = 128
#define N 10  // realistic value = 5000000
#define M 5   // trade of between copy time and parallel gain.
              // Realistic value somewhere around 1000 maybe

#define OPTION 1
// option 1: Make one functor per particle pair => works, but creates too many functor clones
// option 2: Only make one functor clone => no more thread independent member variables
// option 3: Make M clones which get reused => my suggestion, but I don't know how to program it

struct FtorBase
{
  __device__ virtual void execute(long i) = 0;

  __device__ virtual void show() = 0;
};

struct FtorA : public FtorBase
{

  __device__ void execute(long i) final
  {
    F = a*i;
  }

  __device__ void show() final
  {
    printf("F = %f\n", F);
  }

  double a;
  double F;
};

template <class T>
__global__ void cloneFtor(FtorBase** d_ftorBase, T ftor, long n_ftorClones)
{
  const long i = threadIdx.x + blockIdx.x * blockDim.x;

  if (i >= n_ftorClones) {
    return;
  }

  d_ftorBase[i] = new T(ftor);
}

struct ClassA
{
  typedef FtorA ftor_t;

  FtorBase** getFtor()
  {
    FtorBase** d_cmFtorBase;
    cudaMalloc(&d_cmFtorBase, N * sizeof(FtorBase*));

#if OPTION == 1 
    // option 1: Create one copy of the functor per particle pair
    printf("using option 1\n");
    cloneFtor<<<(N + TPB - 1) / TPB, TPB>>>(d_cmFtorBase, ftor_, N);
#elif OPTION == 2
    // option 2: Create just one copy of the functor
    printf("using option 2\n");
    cloneFtor<<<1, 1>>>(d_cmFtorBase, ftor_, 1);
#elif OPTION == 3
    // option 3: Create M functor clones
    printf("using option 3\n");
    printf("This option is not implemented. I don't know how to do this.\n");
    cloneFtor<<<(M + TPB - 1) / TPB, TPB>>>(d_cmFtorBase, ftor_, M);
#endif
    cudaDeviceSynchronize();

    return d_cmFtorBase;
  }

  ftor_t ftor_;
};


__global__ void cudaExecuteFtor(FtorBase** ftorBase)
{
  const long i = threadIdx.x + blockIdx.x * blockDim.x;

  if (i >= N) {
    return;
  }

#if OPTION == 1
  // option 1: One functor per particle was created
  ftorBase[i]->execute(i);
  ftorBase[i]->show();
#elif OPTION == 2
  // option 2: Only one single functor was created
  ftorBase[0]->execute(i);
  ftorBase[0]->show();
#elif OPTION == 3
  // option 3: Reuse the fuctors
  // I don't know how to do this
#endif
}

int main()
{
  ClassA* classA = new ClassA();
  classA->ftor_.a = .1;

  FtorBase** ftorBase = classA->getFtor();

  cudaExecuteFtor<<<(N + TPB - 1) / TPB, TPB>>>(ftorBase);
  cudaDeviceSynchronize();

  return 0;
}

我正在检查F的输出，以查看成员变量在每次调用中是否独立。不出所料，当对每个粒子对使用不同的函子（选项1）时，所有F值都不同；对于整个程序仅使用一个函子（选项2）时，所有F值都相同。

using option 1
F = 0.800000
F = 0.900000
F = 0.000000
F = 0.100000
F = 0.200000
F = 0.300000
F = 0.400000
F = 0.500000
F = 0.600000
F = 0.700000

using option 2
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000
F = 0.700000

我想知道在此播放示例中是否有一种方法可以在不获取N个副本的情况下获得所有不同的F值（选项3）。

PS：我正在使用Ubuntu 18.04，nvcc 9.1和NVIDIA GeForce GTX 1060移动图形卡（CUDA兼容性6.1）。

更新：

在我之前介绍的代码中，只有在调试模式下（带有-G标志的问题）存在问题，而在发行版中则没有问题。我猜测编译器已将printf("F = %f\n", F);优化为printf("F = %f\n", a*i);，从而使依赖于线程的成员变量的问题（这个问题所涉及的问题）消失了。

我更新了代码，因此编译器无法再在printf中进行替换。

如何在CUDA中的许多内核执行中对成员数据重用函子，以提高内存使用率并减少复制时间？

0 个答案: