Question

我考虑过并行化程序，以便在第一阶段将项目分组到以并行工作者数量为模的桶中，这样可以避免第二阶段的冲突。并行程序的每个线程使用std::atomic::fetch_add在输出数组中保留一个位置，然后使用std::atomic::compare_exchange_weak更新当前的桶头指针。所以它是免费的。但是，我怀疑多个线程在单个原子上的性能（我们做fetch_add），因为桶头数等于线程数，因此平均没有太多争用），所以我决定测量一下。这是代码：

#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>

std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;

void CountingThread() {
  for (int64_t i = 0; i < gnAtomicIterations; i++) {
    gCounter.fetch_add(1, std::memory_order_acq_rel);
  }
}

void BenchmarkAtomic() {
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs.emplace_back(CountingThread);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
      (long long)gCounter.load(std::memory_order_acquire));

    thrs.clear();
    gCounter.store(0, std::memory_order_release);
  }
}

int __cdecl main() {
  BenchmarkAtomic();
  return 0;
}

这是输出：

1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000

CPU是8核，16线程（Ryzen 1800X @ 3.9Ghz）。因此，每秒操作的所有操作线程的总数会急剧减少，直到使用4个线程。然后它缓慢下降并稍微波动。

这种现象对其他CPU和编译器来说是否常见？有没有解决方法（除了诉诸单个线程）？

Answer 1

无锁多线程程序并不比单线程程序慢。什么使它变慢是数据争用。您提供的示例实际上是一个备受争议的人工程序。在实际程序中，您将在每次访问共享数据之间执行大量工作，因此它将具有较少的缓存失效等等。 Jeff Preshing的CppCon talk可以比我更好地解释你的一些问题。

添加：尝试修改CountingThread并偶尔添加一个睡眠假装你正在忙于其他东西，而不是递增原子变量gCounter。然后继续在if语句中使用值来查看它将如何影响程序的结果。

void CountingThread() {
  for (int64_t i = 0; i < gnAtomicIterations; i++) {
    // take a nap every 10000th iteration to simulate work on something
    // unrelated to access to shared resource
    if (i%10000 == 0) {
        std::chrono::milliseconds timespan(1);
        std::this_thread::sleep_for(timespan);
    }
    gCounter.fetch_add(1, std::memory_order_acq_rel);
  }
}

通常，每次调用gCounter.fetch_add时，都意味着在其他核心的缓存中标记该数据无效。它迫使他们将数据扩展到远离核心的缓存中。此效果是导致程序性能下降的主要原因。

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

上表取自Approximate cost to access various caches and main memory?

无锁并不意味着您可以在没有成本的情况下在线程之间交换数据。无锁意味着您不必等待其他线程解锁互斥锁以便您读取共享数据。事实上，即使是无锁程序也使用锁定机制来防止数据损坏。

遵循简单的规则。尝试尽可能少地访问共享数据，以从多核编程中获得更多收益。

Answer 2

这取决于具体的工作量。

参见amdahl的法律

                     100 % (whole workload in percentage)
speedup =  -----------------------------------------------------------
            (sequential work load in %) + (parallel workload in %) / (count of workers)

程序中的并行工作负载为0 %，因此加速为1。阿卡没有加速。（您正在同步以递增相同的存储器单元因此，在任何给定时间只有一个线程可以递增单元格。）

粗略的解释，为什么它甚至比speedup=1表现更差：

包含gCounter的缓存行只在一个线程中保留在cpu缓存中。

对于多个线程，这些线程被安排到不同的cpus或核心，包含gCounter的缓存行将在cpus矿核的不同缓存周围反弹。

因此，与为每个增量操作访问内存相比，只有一个线程递增一个寄存器的差异有点可比。（有时它比内存访问更快，因为在现代cpu架构中有缓存来缓存传输。）

Answer 3

与大多数非常广泛的问题一样，更快问题，唯一完全一般的答案是取决于。

一个好的心理模型是，当并行化现有任务时，N线程上的并行版本的运行时将由三个贡献组成：

串行和并行算法共有的仍然串行部分。即，。没有并行化的工作，例如设置或拆除工作，或者没有并行运行的工作，因为任务是不完全分区的¹。

parallel 部分，它在N个工作人员之间有效并行化。

开销组件，表示在串行版本中不存在的并行算法中完成的额外工作。几乎总是有一些小的开销来分配工作，委托给工作线程并合并结果，但在某些情况下，开销会影响实际工作。

因此，一般来说，您有这三个贡献，并分别分配T1p，T2p和T3p。现在T1p组件存在并且在串行和并行算法中花费相同的时间，因此我们可以忽略它，因为它为了确定哪个更慢而取消了。

当然，如果你使用粗粒度同步，例如，在每个线程上递增一个局部变量，并且只是定期（可能只在最后一次）更新共享变量，情况就会反转。

¹这还包括工作负载分区良好的情况，但有些线程每单位时间完成的工作量更多，这在现代CPU和现代操作系统中很常见。

无锁多线程比单线程程序慢吗？

3 个答案: