Question

我有一个性能依赖于左旋指令的程序。

在MSVC下，它运行得相当好，只需将_rotl（）内在函数定义为向左旋转的目标。

在GCC for Linux下，它也运行良好。这里足以定义等效的软件构造rotl32(x,r) = ((x << r) | (x >> (32 - r)))，编译器足够聪明地认识到这是一个32位左旋转，并自动用它的内在等价替换它（公平地说，MSVC也能够做出这样的检测）。

在MinGW下，不是那么多。当MinGW使用GCC时，这就更有趣了。 MinGW可以编译windows内在_rotl，但是没有明显触发相应的内在函数。软件版本似乎也未被发现，尽管公平地说，它仍然比_rotl快。最终结果是性能降低了10倍，所以它绝对是重要的。

注意：测试的MinGW的GCC版本是4.6.2

Answer 1

万一你在Windows上遇到内在函数，这是在x86上使用内联汇编程序的方法;

uint32_t rotl32_2(uint32_t x, uint8_t r) {
  asm("roll %1,%0" : "+r" (x) : "c" (r));
  return x;
}

在Ubuntu的gcc上测试过，但是应该在mingw上运行良好。

Answer 2

只需添加intrin.h标题。

这是特定于Windows的标题，因此如果您正在开发跨平台软件，请不要忘记用以下条件包装它：

#ifdef _WIN32
# include <intrin.h>
#endif

基准

Run on (4 X 3310 MHz CPU s)
09/07/16 23:29:35
Benchmark                    Time           CPU Iterations
----------------------------------------------------------
BM_rotl/8                   19 ns         18 ns   37392923
BM_rotl/64                 156 ns        149 ns    4487151
BM_rotl/512               1148 ns       1144 ns     641022
BM_rotl/4k                9286 ns       9178 ns      74786
BM_rotl/32k              71575 ns      69535 ns       8974
BM_rotl/256k            583148 ns     577204 ns       1000
BM_rotl/2M             4769689 ns    4830999 ns        155
BM_rotl/8M            19997537 ns   18720120 ns         35
BM_rotl_intrin/8             6 ns          6 ns  112178768
BM_rotl_intrin/64           55 ns         53 ns   14022346
BM_rotl_intrin/512         431 ns        407 ns    1725827
BM_rotl_intrin/4k         3327 ns       3338 ns     224358
BM_rotl_intrin/32k       27093 ns      26596 ns      26395
BM_rotl_intrin/256k     217633 ns     214167 ns       3205
BM_rotl_intrin/2M      1885492 ns    1853925 ns        345
BM_rotl_intrin/8M      8015337 ns    7626716 ns         90

基准代码

#include <benchmark/benchmark.h>

#define MAKE_ROTL_BENCHMARK(name) \
  static void name(benchmark::State& state) { \
    auto arr = new uint32_t[state.range(0)]; \
    while (state.KeepRunning()) { \
      for (int i = 0; i < state.range(0); ++i) { \
        arr[i] = _rotl(arr[i], 16); \
      } \
    } \
    delete [] arr; \
  } \
  /**/

MAKE_ROTL_BENCHMARK(BM_rotl)
#include <intrin.h>
MAKE_ROTL_BENCHMARK(BM_rotl_intrin)

#undef MAKE_ROTL_BENCHMARK

BENCHMARK(BM_rotl)->Range(8, 8<<20);
BENCHMARK(BM_rotl_intrin)->Range(8, 8<<20);

BENCHMARK_MAIN()

minGW下的_rotl性能不佳

2 个答案:

基准

基准代码