Question

我正在开发一个应用程序，它经常需要将6到8个带符号的32位整数转换为32位实数。我用自定义汇编程序代码替换了delphi代码，令我惊讶的是，FPU转换总是如此之快，并且在某些计算机上的速度比SSE转换快得多。这里有一些代码说明：

program Project1;

{$R *.res}

uses
 windows,dialogs,sysutils;

type
 piiii=^tiiii;
 tiiii=record i1,i2,i3,i4:longint; end;
 pssss=^tssss;
 tssss=record s1,s2,s3,s4:single; end;

var
 convert_value:single=13579.02468;

function convert_x87(adata:longint):single;
asm
 mov [esp-4],eax
 fild longint([esp-4])
 fmul [convert_value]
end;

procedure convert_sse(afrom,ato,aconv:pointer);
asm
 CVTDQ2PS xmm0,[eax]
 mulps xmm0,[ecx]
 movaps [edx],xmm0
end;

procedure get_mem(var p1,p2:pointer);
begin
 getmem(p1,31);
 p2:=pointer((longint(p1)+15) and (not 15));
end;

var
 a,b,c,d:cardinal;
 z:single;
 i:piiii;
 s1,s2:pssss;
 w1,w2,w3:pointer;
begin
 b:=gettickcount;
 a:=0;
 repeat
  z:=convert_x87(a);

  inc(a);
 until a=0;
 c:=gettickcount-b;

 get_mem(pointer(w1),pointer(i));
 get_mem(pointer(w2),pointer(s1));
 get_mem(pointer(w3),pointer(s2));

 s1.s1:=convert_value;
 s1.s2:=convert_value;
 s1.s3:=convert_value;
 s1.s4:=convert_value;

 b:=gettickcount;
 i.i1:=0;
 i.i2:=1;
 i.i3:=2;
 i.i4:=3;
 repeat
  convert_sse(i,s2,s1);

  inc(i.i1,4);
  inc(i.i2,4);
  inc(i.i3,4);
  inc(i.i4,4);
 until i.i1=0;
 d:=gettickcount-b;

 freemem(w1);
 freemem(w2);
 freemem(w3);

 showmessage('FPU:'+inttostr(c)+'/SSE:'+inttostr(d));
end.

在转换过程中需要重新缩放（这是一个乘法），这就是为什么那里有一个。使用的值只是我选择的一个随机值，但无论我使用什么值，结果都是相同的。 FPU和SSE之间的舍入也有很小的差别，但在这种情况下无关紧要。

但是如果运行该代码，您将看到FPU路径从不比SSE路径慢，并且没有意义。任何人都知道发生了什么事？

编辑：以下是汇编程序中循环的不同源代码。结果非常有趣。如果增量指令被注释掉，则SSE版本比FPU版本快一个明显的数量，但是如果包含增量指令则它们的速度大致相同：

program Project1;

{$R *.res}

uses
 windows,dialogs,sysutils;

type
 piiii=^tiiii;
 tiiii=record i1,i2,i3,i4:longint; end;
 pssss=^tssss;
 tssss=record s1,s2,s3,s4:single; end;

var
 convert_value:single=13579.02468;

procedure test_convert_x87;
asm
 // init test data
 push ebx
 xor ebx,ebx

 mov [esp-4],$98765432

 // convert and multiply 1 int32 to 1 single
@next_loop:
// inc [esp-4]
 fild longint([esp-4])
 fmul [convert_value]
 fstp single([esp-8])

 // loop
 dec ebx
 jnz @next_loop

 pop ebx
end;

procedure test_convert_sse(afrom,ato,aconv:pointer);
asm
 // init test data
 push ebx
 xor ebx,ebx

 mov [eax+0],$98765432
 mov [eax+4],$98765432
 mov [eax+8],$98765432
 mov [eax+12],$98765432

 // convert and multiply 4 int32 to 4 single
@next_loop:
// inc [eax+0]
// inc [eax+4]
// inc [eax+8]
// inc [eax+12]
 cvtdq2ps xmm0,[eax]
 mulps xmm0,[ecx]
 movaps [edx],xmm0

 // loop
 sub ebx,4
 jnz @next_loop

 pop ebx
end;

procedure get_mem(var p1,p2:pointer);
begin
 getmem(p1,31);
 p2:=pointer((longint(p1)+15) and (not 15));
end;

var
 b,c,d:cardinal;
 i:piiii;
 s1,s2:pssss;
 w1,w2,w3:pointer;
begin
 b:=gettickcount;
 test_convert_x87;
 c:=gettickcount-b;

 get_mem(pointer(w1),pointer(i));
 get_mem(pointer(w2),pointer(s1));
 get_mem(pointer(w3),pointer(s2));

 s1.s1:=convert_value;
 s1.s2:=convert_value;
 s1.s3:=convert_value;
 s1.s4:=convert_value;

 b:=gettickcount;
 test_convert_sse(i,s2,s1);
 d:=gettickcount-b;

 freemem(w1);
 freemem(w2);
 freemem(w3);

 showmessage('FPU:'+inttostr(c)+'/SSE:'+inttostr(d));
end.

Answer 1

关于你的asm看起来很慢的主要原因是没有将东西保存在寄存器中。 4个连续记忆位置中的4 inc是疯了，难怪它很慢。 ESP。如果您下次再次从内存中读取它们。在循环外部设置循环计数器向量，然后通过向其添加{ 1, 1, 1, 1 }向量来增加它。

你的问题也没有任何关于32bit-windows调用约定（哪个arg在哪个寄存器中）的提醒，所以我不得不通过查看你的函数arg变量名称来解决这个问题。你使用它们。

所以你的内循环可以是：

; *untested*
    movdqa xmm1, [ vector_of_ones ]   ; or pcmpgt same,same -> all 1s, packed right shift by 32bits
    xor ebx, ebx  ; loop counter
;  also broadcast the scale value to xmm4, maybe with shufps
    movdqa   xmm2, [eax]   ; values to be incremented and converted
loop:
    cvtdq2ps xmm0, xmm2
    mulps    xmm0, xmm4  ; scale
    movaps   [edx], xmm0
    paddd    xmm2, xmm1  ; increment counters
    sub      ebx, 4
    jne      loop  ; loop 2^32 times

    ; movdqa    [eax], xmm2   ; store the incremented loop counter?
    ;  Not sure if this was desired, or a side effect of using mem instead of regs.
    ; If you want this to work on an array, put this store in the loop
    ; and use an indexed addressing mode for eax and edx (or increment pointers)

如果这是针对不会循环的函数，则设置mulps的缩放矢量是不同的。理想情况下，scale arg应该在向量寄存器的低位元素中传递，并使用shufps或其他东西从那里广播它。如果delphi强制它在GP寄存器指向的内存中，那么{I}首先movss。如果它是一个编译时常量，它使用16B向量常量作为mulps的内存操作数可能是要走的路。 Core2及更高版本仅需128个负载的单个周期。（但是，对于旧CPU上的非AVX矢量内容，它确实需要对齐。）

无论如何，我认为你的基准测试速度慢的主要因素是内存访问，特别是写入。每个周期只能有一个商店。如果delphi不能在寄存器中传递float args，那就太糟糕了。

SSE：质量整数转换+ SSE比FPU慢？

1 个答案: