Question

使用c++11线程进行多线程编程，我想确保将算法转换为与数据无关的部分并且并行处理它们会降低整体运行时间。

让我们说任务是在并行化非常简单的整数数组中找到最大值 - 每个线程在特定数据块上找到局部最大值，然后在找到所有局部最大值时结束，我们应该从本地最大值中找到最终的最大值 - 所以运行时应该减少3-4次，使用4个硬件线程（在我的电脑上是4个）

代码

void max_el(
    std::vector<int>& v,
    std::vector<int>::value_type& max, 
    const int& n_threads=1,
    const unsigned int& tid = 0)
{
    max = v[tid];
    for (size_t i = tid, end = v.size(); i < end; i += n_threads)
    {
        if (v[i] > max)
        {
            max = v[i];
        }
    }
}

void max_el_concurrent(std::vector<int>& v)
{
    int n_threads = std::thread::hardware_concurrency();
    std::cout << n_threads << " threads" << std::endl;
    std::vector<std::thread> workers(n_threads);
    std::vector<int> res(n_threads);

    for (size_t i = 0; i < n_threads; ++i)
    {
        workers[i] = std::thread(max_el, std::ref(v), std::ref(res[i]), n_threads, i);
    }

    for (auto& worker: workers)
    {
        worker.join();
    }


    std::vector<int>::value_type final_max;
    max_el(std::ref(res), std::ref(final_max));
    std::cout << final_max << std::endl;
}


void max_el_sequential(std::vector<int>& v)
{
    std::vector<int>::value_type max;
    std::cout << "sequential" << std::endl;
    max_el(v, max);
    std::cout << max << std::endl;
}


template< class Func, class Container >
void profile(Func func, Container cont)
{
    high_resolution_clock::time_point start, now;
    double runtime = 0.0f;

    start = high_resolution_clock::now();
    func(cont);
    now = high_resolution_clock::now();
    runtime = duration<double>(now - start).count();
    std::cout << "runing time = " << runtime << " sec" << std::endl;
}


#define NUM_ELEMENTS 100000000

int main()
{
    std::vector<int> v;
    v.reserve(NUM_ELEMENTS + 100);
    //  filling
    std::cout << "data is ready, running ... " << std::endl;
    profile(max_el_sequential, v);  // 0.506731 sec

    profile(max_el_concurrent, v);  // 0.26108 sec why only ~2 times faster !?

    return 0;
}

尽管std::thread::hardware_concurrency返回4，但与顺序算法相比，此代码的执行只显示了2倍的性能提升。

考虑到/proc/cpu/info显示每个2 cpus 2 cores以及代码中没有任何锁定/解锁，I / O或线程通信开销的事实，我希望理论工作得很好，至少x3，x4次运行时间减少，但这在实践中没有发生......

那为什么会出现这样的行为呢？

到底发生了什么？

Answer 1

在我的系统（Core i7-5820k）上，您的应用程序似乎 内存限制 。

我获得的加速是2.9（有12个线程）。

在我的系统上，最大DRAM带宽为45GB / s：

您的应用程序的单线程运行使用大约16GB / s：

有12个主题：45GB / s：

（具有相同的结果和3..11个线程的总执行时间）

你在这个循环中跨越连续记忆的方式并不太有效：

    for (size_t i = tid, end = v.size(); i < end; i += n_threads)

内存在连续的块中被读入L2缓存，因此并行执行此操作将是浪费;使用64字节高速缓存行和4字节int，这将在每个线程中加载整个数组，最多16个线程。它对L2缓存也非常浪费，因为实际上只使用了每个缓存行的一小部分（我们假设线程不完全同步，活动区域之间的距离很快超过L2大小）。 / p>

补充说明：

不要计时I / O（包括std::cout），这会使结果出现偏差。
尽量不要写入来自不同线程的相邻内存（就像使用res向量一样），否则您的应用程序将受到false sharing的影响。您希望在不同线程写入的内存之间保持至少64个字节的距离。作为快速修复，请将本地最大值收集到本地变量中，并在结尾处只编写一次max。

然而，在这种特定情况下，修复这两者对整体表现没有显着影响。

最后，您的CPU（Core i5-5200）是一款双核超线程处理器。根据英特尔的说法，超线程的加速是on average 30%。这意味着您应该期望最大加速比为2.6（2 + 2 * 0.3）而不是4.0。

在一个简单的例子中解释了并行代码执行和进一步的性能提升

1 个答案: