我的新PC有一个Core i7 CPU,我正在运行我的基准测试,包括使用AVX指令的新版本。我已经安装了Visual Studio 2013以使用更新的编译器,因为我的最后一个编译器无法完全编译以进行完整的SSE SIMD操作。下面是我的一个基准测试(MPMFLOPS)中使用的一些代码,以及使用的编译和链接命令。使用第一个命令运行测试以使用SSE指令。当xtra为16或更低时,基准产生24.4 GFLOPS。 CPU运行频率为3.9 GHz,因此每个周期的计算结果为6.25,而最多为4个倍数和4个加法。将xtra增加到大于16,产生2.6 GFLOPS。将单词减少到更低的值会使速度变差。
/*
Visual Studio 2013
C/C++ Optimizing Compiler Version 18.00.21005.1 for x64
cl /O2 /Oi /MD /W4 /TP /EHsc /Zi /Fa /c mflops.c
cl /O2 /Oi /MD /W4 /TP /EHsc /Zi /Fa /arch:AVX /c mflops.c
link /LARGEADDRESSAWARE mflops.obj CPUasm.obj asmtimeavx.obj
BUFFEROVERFLOWU.LIB
link Includes CPUID information with identification of AVX and timer
*/
#include <stdio.h>
#include <stdlib.h>
#include "asmtimeavx.h"
#include <windows.h>
#include <time.h>
#include <malloc.h>
int main()
{
float *x;
float a = 0.000020f;
float b = 0.999980f;
float c = 0.000011f;
float d = 1.000011f;
float e = 0.000012f;
float f = 0.999992f;
float mflops;
int i, j;
int xtra = 16; // 24447 MFLOPS, > 16 around 2600 MFLOPS
int words = 1000000;
x = (float *)_aligned_malloc(words * 4, 16);
for (i = 0; i < words; i++) x[i] = 0.999999f;
start_time();
for (j = 0; j < xtra; j++)
{
for (i = 0; i < words; i++)
{
x[i] = (x[i] + a)*b - (x[i] + c)*d + (x[i] + e)*f;
}
}
end_time();
mflops = (float)words * (float)xtra * 8.0f / 1000000.0f / (float)secs;
printf("%18.8f, %18.8f, %10.7f secs, %8.2f mflops\n\n", x[0],
x[words-1], secs, mflops);
_aligned_free(x);
return 0;
}
下面显示了生成的汇编代码,其中mulps是完全SIMD,在128位寄存器中有四个值,mulss使用一个浮点数(SISD)。
Windows SSE
words = 1000000
xtra = 16 xtra > 16
call start_time
npad 10
$LL6@main:
mov rcx, rsi
mov edx, 125000
npad 8
$LL3@main:
movups xmm1, XMMWORD PTR [rcx]
add rcx, 32 movaps xmm1, xmm2
movaps xmm2, xmm1 movaps xmm0, xmm2
movaps xmm0, xmm1 addss xmm2, xmm8
addps xmm1, xmm10 addss xmm0, xmm6
addps xmm2, xmm6 addss xmm1, xmm4
addps xmm0, xmm8 dec rax
mulps xmm1, xmm11 mulss xmm2, xmm9
mulps xmm2, xmm7 mulss xmm0, xmm7
mulps xmm0, xmm9 mulss xmm1, xmm5
subps xmm2, xmm0 subss xmm1, xmm0
addps xmm2, xmm1 addss xmm1, xmm2
movups XMMWORD PTR [rcx-32], xmm2 movaps xmm2, xmm1
movups xmm1, XMMWORD PTR [rcx-16] movaps xmm0, xmm1
movaps xmm2, xmm1 addss xmm1, xmm8
movaps xmm0, xmm1 addss xmm2, xmm4
addps xmm2, xmm6 addss xmm0, xmm6
addps xmm0, xmm8 mulss xmm1, xmm9
addps xmm1, xmm10 mulss xmm0, xmm7
mulps xmm2, xmm7 mulss xmm2, xmm5
mulps xmm0, xmm9 subss xmm2, xmm0
mulps xmm1, xmm11 addss xmm2, xmm1
subps xmm2, xmm0 movaps xmm3, xmm2
addps xmm2, xmm1 movaps xmm0, xmm2
movups XMMWORD PTR [rcx-16], xmm2 addss xmm2, xmm8
dec rdx
jne SHORT $LL3@main More of the same
dec rbx Loop 82 lines
jne SHORT $LL6@main
call end_time
接下来,我编译了程序以使用AVX指令,但这与使用SSE的速度相同。以下是生成的汇编代码。还显示了通过Linux(GCC与Ubuntu 14.04)生成的代码中包含的代码要快得多。 Linux的速度有所降低,但是显示的参数通过Windows生成了SISD类型的结果。
请注意,Windows代码使用128位xmm寄存器,但Linux使用256位ymm寄存器。有没有人使用此示例程序解释正在发生的事情或建议以提高性能。
Windows AVX Linux AVX
Only uses xmm registers Part can use ymm registers
Othe parts imcludes xmm
xtra = 1000000; words = 10000; 44271 MFLOPS
call start_time xtra = 10000000; words = 1000; 45653 MFLOPS
npad 6 SSE
$LL6@main: xtra = 1000000; words = 10000; 24492 MFLOPS
mov rcx, rsi
mov edx, 125000
npad 8
$LL3@main: .L24:
vmovups xmm0, XMMWORD PTR [rcx] vmovaps (%rcx,%rax), %ymm6
lea rcx, QWORD PTR [rcx+32] addl $1, %edx
vmovups xmm5, xmm0 vaddps %ymm5, %ymm6, %ymm13
vaddps xmm1, xmm0, xmm6 vaddps %ymm3, %ymm6, %ymm7
vaddps xmm0, xmm0, xmm8 vaddps %ymm1, %ymm6, %ymm6
vmulps xmm2, xmm0, xmm9 vmulps %ymm4, %ymm13, %ymm13
vmulps xmm3, xmm1, xmm7 vmulps %ymm2, %ymm7, %ymm7
vsubps xmm4, xmm3, xmm2 vmulps %ymm0, %ymm6, %ymm6
vaddps xmm1, xmm5, xmm10 vsubps %ymm7, %ymm13, %ymm7
vmulps xmm0, xmm1, xmm11 vaddps %ymm6, %ymm7, %ymm6
vaddps xmm2, xmm4, xmm0 vmovaps %ymm6, (%rcx,%rax)
vmovups XMMWORD PTR [rcx-32], xmm2 addq $32, %rax
vmovups xmm0, XMMWORD PTR [rcx-16] cmpl %edx, %esi
vaddps xmm1, xmm0, xmm6 ja .L24
vmovups xmm5, xmm0
vaddps xmm0, xmm0, xmm8
vmulps xmm2, xmm0, xmm9
vmulps xmm3, xmm1, xmm7
vaddps xmm1, xmm5, xmm10
vsubps xmm4, xmm3, xmm2
vmulps xmm0, xmm1, xmm11
vaddps xmm2, xmm4, xmm0
vmovups XMMWORD PTR [rcx-16], xmm2
dec rdx
jne SHORT $LL3@main
dec rbx
jne $LL6@main
call end_time