Question

我正在尝试在C ++中创建一个for循环多线程，以便将计算划分为多个线程。然而，它包含需要按顺序连接在一起的数据。

因此，我们的想法是首先加入许多核心（25.000+循环）上的小位，然后在最后再次加入组合数据。

std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings
myData threadData;                  // data per thread

    #pragma parallel for default(none) private(data, threadData) shared(combineData)
        for (int i=0; i<30000; i++)
        {
            threadData += combineData[ids[i]];
        }

    // Then here I would like to get all the seperate thread data and combine them in a similar manner
    // I.e.: for each threadData:  outputData += threadData

采用这种方法的有效方法是什么？

如何安排openmp循环以便将调度分成均匀分区

例如2个线程： [0,1,2,3,4，...，14999]＆amp; [15000,15001,15002,15003,15004，..，29999]

如果有更好的方法来加入数据（这涉及将很多std :: vector连接在一起并进行一些矩阵数学运算），那么保留添加指针的顺序也会有所帮助。

添加信息

添加是关联的，但不是可交换的。
myData不是内在类型。它是一个包含多个std :: vectors数据的类（以及与Autodesk Maya API相关的一些数据。）
每个循环对许多点进行类似的矩阵乘法，并将这些点加到向量上（理论上，每个循环的计算时间应保持大致相似）。

基本上它是将网格数据（由数据向量组成）添加到彼此（组合网格），尽管整个事物的顺序考虑了顶点的索引值。顶点索引应该是一致的并且可以重建。

Answer 1

这取决于myData的加法运算符的一些属性。如果运算符既是关联的(A + B) + C = A + (B + C)也是交换的A + B = B + A，那么您可以使用critical部分，或者数据是普通的旧数据（例如，float，int，...）一个reduction。

但是，如果它不像您所说的那样是可交换的（操作顺序很重要）但仍然是关联的，您可以用一些元素填充一个数组，这些元素并行地等于组合数据的线程数，然后按顺序合并它们serial（请参阅下面的代码。使用schedule（静态）将根据需要或多或少均匀地分割块，并随着线程数的增加。

如果运算符既不是关联的也不是可交换的，那么我认为你不能并行化它（有效地 - 例如尝试有效地并行化Fibonacci系列）。

std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings
myData *threadData;
int nthreads;
#pragma omp parallel
{
    #pragma omp single
    {
        nthreads = omp_get_num_threads();
        threadData = new myData[nthreads];
    }
    myData tmp;
    #pragma omp for schedule(static)
    for (int i=0; i<30000; i++) {
        tmp += combineData[ids[i]];
    }
    threadData[omp_get_thread_num()] = tmp;
}
for(int i=0; i<nthreads; i++) {
     outputData += threadData[i];
}
delete[] threadData;

编辑：我现在不是100％肯定，如果按照#pragma omp for schedule(static)增加的线程编号顺序分配块（虽然如果不是，我会感到惊讶）。这个问题正在持续discussion。同时，如果你想100％肯定，那么而不是

#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
    tmp += combineData[ids[i]];
}

你可以做到

const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*30000/nthreads;
const int finish = (ithread+1)*30000/nthreads;
for(int i = start; i<finish; i++) {
     tmp += combineData[ids[i]];          
}

编辑：

我找到了一种更优雅的方式来平行填充但按顺序合并

#pragma omp parallel
{
    myData tmp;
    #pragma omp for schedule(static) nowait 
    for (int i=0; i<30000; i++) {
        tmp += combineData[ids[i]];
    }
    #pragma omp for schedule(static) ordered 
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        outputData += tmp;
    }
}

这可以避免为每个线程（threadData）分配数据并在并行区域外合并。

Answer 2

如果确实希望保留与序列案例中相同的顺序，那么除了连续执行此操作之外别无他法。在这种情况下，您可以尝试并行化operator+=中完成的操作。

如果操作可以随机完成，但块的减少具有特定的顺序，则可能值得查看TBB parallel_reduce。它将要求您编写更多代码，但如果我记得很清楚，您可以定义复杂的自定义缩减操作。

如果操作顺序无关紧要，那么您的代码段几乎已完成。它缺少的可能是critical构造来聚合私有数据：

std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings

#pragma omp parallel
{ 
    myData threadData;              // data per thread

    #pragma omp for nowait
    for (int ii =0; ii < total_iterations; ii++)
    {
        threadData += combineData[ids[ii]];
    }
    #pragma omp critical
    {
        outputData += threadData;
    }    
    #pragma omp barrier
    // From here on you are ensured that every thread sees 
    // the correct value of outputData 
 }

在这种情况下，for循环的调度对于语义并不重要。如果operator+=的重载是一个相对稳定的操作（就计算它所需的时间而言），则可以使用schedule(static)在线程之间均匀地划分迭代。否则，您可以采用其他计划来平衡计算负担（例如schedule(guided)）。

最后，如果myData是固有类型的typedef，那么您可以避开关键部分并使用reduction子句：

    #pragma omp for reduction(+:outputData)
    for (int ii =0; ii < total_iterations; ii++)
    {
        outputData += combineData[ids[ii]];
    }

在这种情况下，您无需将任何内容明确声明为私有。

C ++ OpenMP：以偶数块静态拆分for循环，最后加入数据

2 个答案: