Question

我正在并行处理以下块，以便计算存储在localRows（作为向量）中的可变大小（0.5M，1M，2M行等）数据集上每周的致命事故数。共享变量local_lethAccPerWeek, local_accAndPerc, local_boroughWeekAcc是连续存储的数组（例如int local_lethAccPerWeek[NUM_YEARS][NUM_WEEKS_PER_YEAR] = {};）。

    // [2] Data processing
    procBegin = MPI_Wtime();

    cout << "[Proc. " + to_string(myrank) + "] Started processing dataset..." << endl;
    omp_set_num_threads(num_omp_threads);
    int cfIndex, brghIndex;

// Every worker will compute in the final datastructure the num of lethal accidents for its sub-dataset and then Reduce it to allow the master to collect final results
#pragma omp parallel for default(shared) schedule(dynamic) private(cfIndex, brghIndex)
    for (int i = 0; i < my_num_rows; i++)
    {
        int lethal = (localRows[i].num_pers_killed > 0) ? 1 : 0;
        string borough = string(localRows[i].borough);
        int week, year, month = 0;

        if (lethal || !borough.empty())
        {
            week = getWeek(localRows[i].date);
            year = getYear(localRows[i].date);
            month = getMonth(localRows[i].date);

            // If I'm week = 1 and month = 12, this means I belong to the first week of the next year.
            // If I'm week = (52 or 53) and month = 01, this means I belong to the last week of the previous year.
            if (week == 1 && month == 12)
                year++;
            else if ((week == 52 || week == 53) && month == 1)
                year--;

            year = year - 2012;
            week = week - 1;
        }

        /* Query1 */
        if (lethal)
#pragma omp atomic
            local_lethAccPerWeek[year][week]++;

        /* Query2 */
        for (int k = 0; k < localRows[i].num_contributing_factors; k++)
        {
            cfIndex = cfDictionary.at(string(localRows[i].contributing_factors[k]));
#pragma omp critical(query2)
            {
                (local_accAndPerc[cfIndex].numAccidents)++;
                (local_accAndPerc[cfIndex].numLethalAccidents) += lethal;
            }
        }

        /* Query3 */
        if (!borough.empty()) // if borough is not specified we're not interested
        {
            brghIndex = brghDictionary.at(borough);
#pragma omp critical(query3)
            {
                local_boroughWeekAcc[brghIndex][year][week].numAccidents++;
                local_boroughWeekAcc[brghIndex][year][week].numLethalAccidents += lethal;
            }
        }
    }

    procDuration = MPI_Wtime() - procBegin;

由于增加了omp线程会增加我的执行时间，因此我遇到了奇怪的行为。我知道，由于上下文切换等原因，产生线程会增加开销，在某些情况下，让一个线程来完成这项工作可能会更顺畅，但是我看不到如何并行化这种操作（这仅仅是原子部分的增加）可能会更糟。我也尝试过出于好奇而更改日程安排，但当然无济于事。

我问你是因为你可能会看到我所缺少的东西。预先感谢，如果您需要更多信息，请发表评论。

Answer 1

这里有几点注意事项：

您正在使用schedule(dynamic)。这意味着循环的每一次迭代都将按照先到先得的原则分配到不同的线程。这会增加很多开销，尤其是在my_num_rows大的情况下。最好使用大块迭代，每个迭代说N，因此请尝试将schedule子句更改为schedule(dynamic,N)。

您有很多正确与错误共享的实例，在这些实例中，由于以下两个原因，使拥有CPU缓存的好处无效。

共享变量的原子更新比单线程并行执行要慢得多，因为保存该值的L1 / L2缓存行不断失效并从缓存层次结构中向下重新加载。在顺序程序中，高速缓存行仍然很热，并且如果它是单个值，则编译器甚至可以应用寄存器优化（最后一个不适用于您的情况，因为要增加数组元素）。
与上一个类似，当您更新恰好位于同一缓存行中的两个不同的数组元素时，就会发生错误共享。例如，在第二季度似乎很有可能，尤其是在影响因素数量较少的情况下。

您可以做的是按自治市和日期对localRows进行排序和分组，然后将计算结果分散到各个组中。这将防止在更新第一季度和第三季度的汇总时出现真假共享问题。至于第二季度的影响因素，如果不是很多，请使用OpenMP缩减。

随着OMP线程数量的增加，执行时间更长

1 个答案: