Question

我已经在CPU AESNI和GPU AES之间进行了一段时间的比较。最近我更新了g ++编译器（从4.6到4.8），并且CPU AESNI的性能显着提高（~2x）。

我有一个简化的C代码来＆＃34;模拟＆＃34;使用AESNI指令进行AES加密（如下所示）。

__m128i cipher_128i;
_ALIGNED(16) unsigned char in_alligned[16];
_ALIGNED(16) unsigned char out_alligned[16];

// store plaintext in cipher variable than encrypt
memcpy(in_alligned, buf_in, 16);
cipher_128i = _mm_load_si128((__m128i *) in_alligned);

cipher_128i = _mm_xor_si128(cipher_128i, key_exp_128i);
/* then do 9 rounds of aesenc, using the associated key parts */
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i);
/* then 1 aesenclast rounds */
cipher_128i = _mm_aesenclast_si128(cipher_128i, key_exp_128i);

// store back from register & copy to destination
_mm_store_si128((__m128i *) out_alligned, cipher_128i);
memcpy(buf_out, out_alligned, 16);

关于AMD 5400K（串行执行）的1GB buf_in数据的此代码产生以下结果：

g ++ - 4.6 |实际0m2.982s，用户0m2.466s，sys 0m0.433s
g ++ - 4.7 |实际0m1.453s，用户0m0.877s，sys 0m0.512s
g ++ - 4.8 |实际0m1.157s，用户0m0.592s，sys 0m0.468s

我为每个版本的g ++（4.6,4.7,4.8）生成了程序集，发现编译器正在用 movdqu 替换 movdqa / movq 类型的指令集（见下图）。 http://postimg.org/image/q6j8qwyol/

假设这是改善是否安全？是否有意义？为什么g ++ 4.6不首先考虑这个指令？

Answer 1

我注意到3件事影响了3：

之间的表现

1）更好地复制数据。在旧的GCC中，它似乎将16B副本分解为2个8B加载/存储。这可能是因为未对齐的指令过去对旧架构的性能（它们是微编码的）很糟糕。在英特尔的Nehalem处理器之后，未对齐指令的速度与对齐指令一样快，假设没有缓存分裂。因此，编译器试图通过更加自由地使用未对齐的指令来利用这一点。

2）看起来GCC优化了缓冲区溢出检查，这导致了一些开销。 Haven没有仔细研究原因。

3）看起来他们还优化了将堆栈指针动态对齐到32B的需要（在第一种情况下需要使得我们可以使用movdqa，在第二种情况下不需要，因此，可能是一个perf-bug，并且在第三种情况下优化了。）

在aesni中，巨大的性能改进g ++ 4.6 vs 4.7 vs 4.8

1 个答案: