Question

我正在处理与图遍历相关的一些任务（维特比算法）每一步我都有一组紧凑的活动状态，在每个状态下完成一些工作，并且结果通过传出弧传播到每个弧的目标状态，因此构建了新的活动状态集。问题是输出弧的数量变化非常大，从两个或三个到几千个。因此，计算线程的加载效率非常低。

我尝试通过共享本地内存队列

共享作业

int tx = threaIdx.x;

extern __shared__ int smem[];

int *stateSet_s = smem;                     //new active set
int *arcSet_s = &(smem[Q_LEN]);             //local shared queue
float *scores_s = (float*)&(smem[2*Q_LEN]);

__shared__ int arcCnt;
__shared__ int stateCnt;

if ( tx == 0 )
{
   arcCnt = 0;
   stateCnt = 0;
}

__syncthreads();

//load state index from compacted list of state indexes
int stateId = activeSetIn_g[gtx];

float srcCost = scores_g[ stateId ];
int startId = outputArcStartIds_g[stateId];

int nArcs = outputArcCounts_g[stateId]; //number of outgoing arcs to be propagated (2-3 to thousands)

/////////////////////////////////////////////
/// prepare arc set
/// !!!! that is the troubled code I think !!!!
/// bank conflicts? uncoalesced access?

int myPos = atomicAdd ( &arcCnt, nArcs );

while ( nArcs > 0 ) && ( myPos < Q_LEN ) )
{
    scores_s[myPos] = srcCost;
    arcSet_s[myPos] = startId + nArcs - 1;

    myPos++;
    nArcs--;
}

__syncthreads();

//////////////////////////////////////
/// parallel propagate arc set

if ( arcSet_s[tx] > 0 )
{
   FstArc arc = arcs_g[ arcSet_s[tx] ];
   float srcCost_ = scores_s[tx];

   DoSomeJob ( &srcCost_ );

   int *dst = &(transitionData_g[arc.dst]);

   int old = atomicMax( dst, FloatToInt ( srcCost_ ) );

   ////////////////////////////////
   //// new active set

   if ( old == ILZERO )
   {
      int pos = atomicAdd ( &stateCnt, 1 );
      stateSet_s[ pos ] = arc.dst;
   }
}

/////////////////////////////////////////////
/// transfer new active set from smem to gmem

__syncthreads();

__shared__ int gPos;

if ( tx == 0 )
{
   gPos = atomicAdd ( activeSetOutSz_g, stateCnt );
}

__syncthreads();

if ( tx < stateCnt )
{
    activeSetOut_g[gPos + tx] = stateSet_s[tx];
}

__syncthreads();

但是它运行速度非常慢，我的意思是如果没有使用有效设置（活动设置=所有状态）则更慢，尽管活动设置是所有状态的10-15％。登记压力大大增加，入住率很低，但我认为无法做任何事情。

可能有更有效的线程间工作共享方式吗？想想关于3.0的warp-shuffle操作，但我必须使用2.x设备。

Answer 1

通常使用多个CUDA内核调用来解决工作负载不均匀和动态工作创建的问题。这可以通过使CPU循环如下所示来完成：

//CPU pseudocode
while ( job not done) {
    doYourComputationKernel();
    loadBalanceKernel();
}

doYourComputationKernel（）必须具有启发式功能，以便知道何时停止并将控制权发送回CPU以平衡工作负载。这可以通过使用空闲块数的全局计数器来完成。每次块完成其工作或无法创建更多工作时，此计数器都会递增。当空闲块数超过阈值时，所有块中的工作都将保存到全局内存中，所有块都将完成。

loadBalanceKernel（）应该接收包含所有已保存工作的全局数组和每个块的另一个全局工作计数器数组。稍后的减少操作可以计算工作总数。通过这个，可以找到每个块的工作数量。最后，内核应该复制工作，这样每个块都会收到相同数量的元素。

循环继续，直到完成所有计算。有一篇关于此的好文章：http://gamma.cs.unc.edu/GPUCOL/。这个想法是平衡连续碰撞检测的负荷，这是非常不平衡的。

在CUDA线程中共享高度不规则的工作

1 个答案: