Question

我正在尝试通过使用OpenCL从GPU上的项目移植一些CPU功能来缩短执行时间。

使用VS profiler我发现在InsertCandidate（）函数上花费的执行时间最多，我正在考虑编写一个内核来在GPU上执行这个函数。该函数最昂贵的操作是for指令。但是可以看出，每个循环包含3个if指令，这可能导致分歧，导致序列化，即使在GPU上执行也是如此。

template <class Tp>
int InsertCandidate( std::list<Tp> &detected, const Tp &box, double &nProbThreshold, int nMaxCandidate, double nMinProb )
{
    if( box._prob < nMinProb && box._prob < nProbThreshold )
        return -1;
    // Only use detection score to select positives.
    if( nMaxCandidate == 0 )
    {
        if( box._prob > nMinProb )
            detected.push_back( box );
        return 0;
    }

    typename std::list<Tp>::iterator    iter;
    int nCandidate = 0;

    for( iter = detected.begin(); iter != detected.end(); iter++ )
    {
        if( nCandidate == nMaxCandidate-1 )
            nProbThreshold = iter->_prob;

        if( box._prob >= iter->_prob )
            break;
        if( nCandidate >= nMaxCandidate && box._prob <= nMinProb )
            break;
        nCandidate ++;

    }

    if( nCandidate < nMaxCandidate || box._prob > nMinProb )
        detected.insert( iter, box );
    return 0;
}

作为结论，该程序可以转换为openCL吗？

可以使用OpenCL并行化此功能吗？

0 个答案: