我通过Udacity在线课程学习CUDA。在其中一个任务中,我注意到在内核中动态分配本地指针会大大减慢我的程序速度。
<extension
point="org.eclipse.ui.menus">
<menuContribution
allPopups="false"
locationURI="popup:org.eclipse.ui.popup.any?after=group.edit">
<command
commandId="org.eclipse.ui.edit.copy"
icon="copy.gif"
id="copyResource"
label="Copy"
style="push">
<visibleWhen
checkEnabled="true">
<iterate
ifEmpty="false"
operator="or">
<or>
<instanceof
value="MyNavigatorClass">
</instanceof>
</or>
</iterate>
</visibleWhen>
</command>
</menuContribution>
如果我取消注释上述代码段中涉及本地指针__global__
void yourHisto(const unsigned int* const vals,
unsigned int* const d_histo, int numVals, const int numBins, const int valsPerT, int blocks)
{
extern __shared__ unsigned int s_bin[];
int bId = blockIdx.x;
int tId = threadIdx.x;
int id = bId*blockDim.x + tId;
int sBinPerT = (numBins - 1) / blockDim.x + 1;
for (int i = 0; i < sBinPerT; i++)
{
if (i*blockDim.x + tId < numBins)
s_bin[i*blockDim.x + tId]=0;
}
//unsigned int* lclVal = new unsigned int[valsPerT];
for (int i = 0; i<valsPerT; i++)
{
if (i*blocks*blockDim.x + id < numVals)
{
//lclVal[i] = vals[i*blocks*blockDim.x + id];
atomicAdd(&s_bin[vals[i*blocks*blockDim.x + id]], 1);
}
}
__syncthreads();
for (int i = 0; i < sBinPerT; i++)
{
if (i*blockDim.x + tId < numBins)
atomicAdd(&d_histo[i*blockDim.x + tId], s_bin[i*blockDim.x + tId]);
}
//delete lclVal; lclVal = NULL;
}
的部分,然后将lclVal
替换为atomicAdd(&s_bin[vals[i*blocks*blockDim.x + id]], 1)
,则代码实际上会减慢约300倍(~1.2ms VS 〜370ms)。知道为什么会这样吗?非常感谢。