Question

下面的代码演示了多线程编程的好奇心。特别是在单个线程中std::memory_order_relaxed增量与常规增量的性能。我不明白为什么fetch_add（宽松）单线程比常规增量慢两倍。

static void BM_IncrementCounterLocal(benchmark::State& state) {
  volatile std::atomic_int val2;

  while (state.KeepRunning()) {
    for (int i = 0; i < 10; ++i) {
      DoNotOptimize(val2.fetch_add(1, std::memory_order_relaxed));
    }
  }
}
BENCHMARK(BM_IncrementCounterLocal)->ThreadRange(1, 8);

static void BM_IncrementCounterLocalInt(benchmark::State& state) {
  volatile int val3 = 0;

  while (state.KeepRunning()) {
    for (int i = 0; i < 10; ++i) {
      DoNotOptimize(++val3);
    }
  }
}
BENCHMARK(BM_IncrementCounterLocalInt)->ThreadRange(1, 8);

输出：

      Benchmark                               Time(ns)    CPU(ns) Iterations
      ----------------------------------------------------------------------
      BM_IncrementCounterLocal/threads:1            59         60   11402509                                 
      BM_IncrementCounterLocal/threads:2            30         61   11284498                                 
      BM_IncrementCounterLocal/threads:4            19         62   11373100                                 
      BM_IncrementCounterLocal/threads:8            17         62   10491608

      BM_IncrementCounterLocalInt/threads:1         31         31   22592452                                 
      BM_IncrementCounterLocalInt/threads:2         15         31   22170842                                 
      BM_IncrementCounterLocalInt/threads:4          8         31   22214640                                 
      BM_IncrementCounterLocalInt/threads:8          9         31   21889704

Answer 1

使用volatile int，编译器必须确保它不会优化和/或重新排序变量的任何读/写。

对于fetch_add， CPU 必须采取预防措施，即读取 - 修改 - 写入操作是原子的。

这两个完全不同的要求：原子性要求意味着CPU必须与机器上的其他CPU通信，确保它们不会在自己的读写之间读/写给定的内存位置。如果编译器使用比较和交换指令编译fetch_add，它实际上会发出一个短循环来捕获其他CPU修改其中的值的情况。

对于volatile int，不需要此类通信。相反，volatile要求编译器不发明任何读取：volatile是为与硬件寄存器的单线程通信而设计的，其中仅仅读取值的行为可能有副作用。

Answer 2

本地版本不使用原子。（它使用volatile的事实是一个红色的鲱鱼 - volatile在多线程代码中基本上没有任何意义。）

原子版是使用原子（！）。实际上只有一个线程实际上将用于访问变量的事实对于CPU是不可见的，并且我并不感到惊讶，编译器也没有发现它。（没有必要浪费开发人员的努力来确定将std::atomic_int转换为int是否安全，当它几乎永远不会出现时。如果他们不在，那么没有人会写atomic_int＆＃ 39;需要从多个线程访问它。）

因此，原子版本将难以确保增量实际上是原子的，坦率地说，我感到惊讶的是它只有2倍慢 - 我本来期望更像10倍。

atomic fetch_add vs添加性能

2 个答案: