Question

在Delphi math.pas 单元中，有一个过程 DivMod ，我想将其转换为内联并对其进行优化，以使除数始终为10。但是我不知道五角大楼ASM的细节。波纹管程序的转换是什么

 procedure DivMod(Dividend: Integer; Divisor: Word;
  var Result, Remainder: Word);
asm
        PUSH    EBX
        MOV     EBX,EDX
        MOV     EDX,EAX
        SHR     EDX,16
        DIV     BX
        MOV     EBX,Remainder
        MOV     [ECX],AX
        MOV     [EBX],DX
        POP     EBX
end;

Answer 1

到目前为止，您可以做的最重要的优化是使用定点乘法逆来除以编译时常数Why does GCC use multiplication by a strange number in implementing integer division?。

任何体面的C编译器都可以为您做到这一点，但显然Delphi不会这样做，因此使用asm这样做是有道理的。

是否可以在EAX中返回一个值，而不是将商和余数都存储到内存中？传递2个指针args似乎很浪费，并且强制调用者从内存中检索值。（更新，是的，我认为您可以通过使其成为函数而不是过程来实现；不过，我只是从其他答案中盲目地修改了Delphi代码。）

无论如何，幸运的是，我们可以使用C编译器来为我们计算乘法逆和移位数。我们甚至可以使它使用与Delphi用于内联asm相同的“调用约定”。 GCC's regparm=3 32-bit calling convention在EAX，EDX和ECX中按此顺序传递参数。

在只需要商的情况下，您可能需要制作一个单独的版本，因为（与慢速div指令不同），如果您需要将余数计算为x - (x/y)*y使用快速乘法逆。但是，是的，仍然是现代x86的两倍到4倍。

或者您可以将其余的计算留在纯Delphi中完成，除非编译器通常对优化不满意。

#ifdef _MSC_VER
#define CONVENTION  _fastcall   // not the same, but 2 register args are better than none.
#else
#define CONVENTION __attribute__((regparm(3)))
#endif

// use gcc -Os to get it to emit code with actual div.

divmod10(unsigned x, unsigned *quot, unsigned *rem) {
    unsigned tmp = x/10;
    // *quot = tmp;
    *rem = x%10;
    return tmp;
}

From the Godbolt compiler explorer：

# gcc8.2  -O3 -Wall -m32
div10:    # simplified version without the remainder, returns in EAX
        mov     edx, -858993459     # 0xCCCCCCCD
        mul     edx                 # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax, edx
        shr     eax, 3
        ret
      # quotient in EAX

# returns quotient in EAX, stores remainder to [ECX]
# quotient pointer in EDX is unused (and destroyed).
divmod10:
        mov     edx, -858993459
        push    ebx
        mov     ebx, eax
        mul     edx                      # EDX:EAX = dividend * 0xCCCCCCCD
        mov     eax, edx
        shr     eax, 3
        # quotient in EAX = high_half(product) >> 3 = product >> (32+3)
        lea     edx, [eax+eax*4]         # EDX = quotient*5
        add     edx, edx                 # EDX = quot * 10
        sub     ebx, edx                 # remainder = dividend - quot*10
        mov     DWORD PTR [ecx], ebx     # store remainder
        pop     ebx
        ret
        # quotient in EAX

这是C编译器的输出。根据需要适应Delphi内联汇编；我认为输入是在Delphi的正确寄存器中。

如果Delphi inline-asm不允许您破坏EDX，则可以保存/恢复它。或者，您想删除未使用的quotient指针输入，则可以调整asm或在Godbolt上调整C并查看新的编译器输出。

与div相比，这是更多的指令，但是div的运行速度非常慢（10微秒，甚至在Skylake上也有26个周期的延迟）。

如果在Delphi中具有64位整数类型，则可以在Delphi源代码中执行此操作，并避免使用内联asm。或如MBo所示，对于仅使用32位整数类型的0..2 ^ 16-1范围内的输入，可以将$CCCD用作乘法逆。

对于其余部分，存储/重载往返（4到5个周期）的延迟与使用移动消除功能的最新Intel CPU的实际计算相似（3 +1为商，+ 3 lea / add / sub = 7），因此必须为此使用内联asm。但这在延迟和吞吐量方面仍然优于div指令。请参阅https://agner.org/optimize/和其他性能链接in the x86 tag wiki。

您可以复制/粘贴的Delphi版本

（如果我没看错，我不了解Delphi，只是根据我对调用约定/语法的推断，在SO和this site上复制并修改了示例）

我不确定我对inline-asm的arg-passing权限是否正确。 This RADStudio documentation说：“除了ESP和EBP之外，asm语句在进入该语句时不能假设任何有关寄存器内容的信息。”但我假设args位于EAX和EDX中。

将asm用于64位代码可能很愚蠢，因为在64位中，您可以有效地将纯Pascal用于64位乘法。 How do I implement an efficient 32 bit DivMod in 64 bit code。因此，在{$IFDEF CPUX64}块中，最好的选择可能是使用UInt64(3435973837)*num;

function Div10(Num: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Result := Num div 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
        MOV     EDX, $CCCCCCCD
        MUL     EDX                   // EDX:EAX = Num * fixed-point inverse
        MOV     EAX,EDX               // mov then overwrite is ideal for Intel mov-elimination
        SHR     EAX,3
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
         // TODO: use pure pascal for this; Uint64 is efficient on x86-64
        // Num in ECX, upper bits of RCX possibly contain garbage?
        mov     eax, ecx              // zero extend Num into RAX
        mov     ecx, $CCCCCCCD        // doesn't quite fit in a sign-extended 32-bit immediate for imul
        imul    rax, rcx              // RAX = Num * fixed-point inverse
        shr     rax, 35               // quotient = eax
end;
{$ENDIF CPUX64}
{$ENDIF}

 {Remainder is the function return value}
function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
{$IFDEF PUREPASCAL}
begin
  Quotient := Num div 10;
  Result := Num mod 10;
end;
{$ELSE !PUREPASCAL}
{$IFDEF CPUX86}
asm
    // Num in EAX,  @Quotient in EDX
    push    esi
    mov     ecx, edx           // save @quotient
    mov     edx, $CCCCCCCD
    mov     esi, eax           // save dividend for use in remainder calc
    mul     edx                // EDX:EAX = dividend * 0xCCCCCCCD
    shr     edx, 3             // EDX = quotient
    mov     [ecx], edx         // store quotient into @quotient

    lea     edx, [edx + 4*edx] // EDX = quot * 5
    add     edx, edx           // EDX = quot * 10
    mov     eax, esi                  // off the critical path
    sub     eax, edx           // Num - (Num/10)*10
    pop     esi
    // Remainder in EAX = return value
end;
{$ENDIF CPUX86}
{$IFDEF CPUX64}
asm
        // TODO: use pure pascal for this?  Uint64 is efficient on x86-64
    // Num in ECX,   @Quotient in RDX
    mov     r8d, ecx          // zero-extend Num into R8
    mov     eax, $CCCCCCCD
    imul    rax, r8
    shr     rax, 35           // quotient in eax

    lea     ecx, [rax + 4*rax]
    add     ecx, ecx          // ecx = 10*(Num/10)
    mov     [rdx], eax        // store quotient

    mov     eax, r8d          // copy Num again
    sub     eax, ecx          // remainder = Num - 10*(Num/10)
    // we could have saved 1 mov instruction by returning the quotient
    // and storing the remainder.  But this balances latency better.
end;
{$ENDIF CPUX64}
{$ENDIF}

存储商并返回余数意味着两者都可能在调用者中几乎同时准备就绪，因为从商计算余数的额外延迟与存储转发重叠。 IDK如果这很好，或者如果执行顺序混乱而开始执行基于商的某些工作则更好。我猜想如果您调用DivMod10，则可能只需要其余部分。

但是在重复除以10的小数位数拆分循环中，商是形成关键路径的原因，因此，返回商并存储余数的该版本会更好在那里选择。

在这种情况下，您将使用EAX的返回值作为商，并将函数arg重命名为余数。

此版本的C函数（https://godbolt.org/z/qu2kvV）基于Windows输出的clang输出，以Windows x64调用约定为目标。但是需要进行一些调整以提高效率，例如使mov脱离关键路径，并使用不同的寄存器来避免REX前缀。并用一个ADD代替一个LEA。

unsigned divmod10(unsigned x, unsigned *quot) {
    unsigned qtmp = x/10;
    unsigned rtmp = x%10;
     *quot = qtmp;
     //*rem = rtmp;
    return rtmp;
}

我使用clang的版本而不是gcc的版本，因为imul r64,r64在Intel CPU和Ryzen上速度更快（3个周期延迟/ 1 uop）。 mul r32为3 oups，在Sandybridge系列上每2个时钟只有1个吞吐量。我认为乘法硬件自然会产生128位结果，并将其低64位分解为edx：eax需要额外的uop或类似的东西。

Answer 2

从this answer开始，您可以通过使用SSE利用硬件32x32-> 64位乘法来获得32位编译中的性能：

program Project1;
{$APPTYPE CONSOLE}

uses
  Windows, SysUtils;

procedure DivMod10(num : Cardinal; var q, r : Cardinal);
const
  m : cardinal = 3435973837;
asm
  movd xmm0, m         {move magic number to xmm0}
  movd xmm1, eax       {move num to xmm1}
  pmuludq xmm0, xmm1   {xmm0[0:32] * xmm1[0:32] -> xmm0[0:64] unsigned}
  psrlq xmm0, 35       {right shift xmm0}
  movss [edx], xmm0    {store quotient to q}
  movd edx, xmm0       {recycle edx, store q}
  imul edx, -$A        {edx = q * (-10)}
  add edx, eax         {edx = r}
  mov [ecx], edx       {store r}
end;

var
  q, r, t0, i : cardinal;
begin
  t0 := GetTickCount;
  for I := 1 to 999999999 do DivMod10(i, q, r);
  WriteLn('SSE ASM : ' + IntToStr(GetTickCount - t0));

  t0 := GetTickCount;
  for I := 1 to 999999999 do q := i div 10;
  WriteLn('div : ' + IntToStr(GetTickCount - t0));

  WriteLn('Test correctness...');
  for I := 1 to High(Cardinal) do begin
    DivMod10(i,q,r);
    if (q <> (i div 10)) or (r <> (i mod 10)) then
      WriteLn('Incorrect Result : ' + IntToStr(i));
  end;

  WriteLn('Test complete.');
  Readln;
end.

这产生了：

SSE ASM：2449
  div：3401
  测试正确性...
  测试完成。

这通常不是安全的，因为您应该在运行时检查CPU是否支持所需的SSE指令（并在这种情况下使用purepascal替代方法），但是越来越难找到能够运行且足够旧的CPU至少不支持SSE2。

对于支持该功能的系统，它的性能可能比div高（例如，我在Haswell上使用DivMod10可获得约25％的性能优势），剩下的就可以了。速度不及原生的64位IMUL，但仍然非常有用。

要解决Peter的评论，请考虑使用纯x86版本：

procedure DivMod10(num : Cardinal; var q, r : Cardinal);
const
  m : cardinal = 3435973837;
asm
  push eax
  push edx
  mul m
  mov eax, edx
  shr eax, 3
  pop edx
  mov [edx], eax
  pop eax
  imul edx, [edx], -$A
  add edx, eax
  mov [ecx], edx
end;

产生（对我来说-Haswell i7）：

x86 ASM：2948
  div：3401
  测试正确性...
  测试完成。

比SSE版本慢18％。

有了Peter的一些好的想法，我们可以进一步优化纯x86版本，通过转换为函数并用imul和lea替换立即数add来保存寄存器。：

function DivMod10(Num: Cardinal; var Quotient: Cardinal): Cardinal;
const
  m : cardinal = 3435973837;
asm
  mov ecx, eax           {save num to ecx}
  push edx               {save quotient pointer}
  mul m                  {edx:eax = m*Num}
  shr edx, 3             {edx = quotient}
  pop eax                {restore quotient pointer}
  mov [eax], edx         {store quotient}
  mov eax, ecx           {restore num to eax}
  lea ecx, [edx +4*edx]  {ecx = q*5}
  add ecx, ecx           {ecx = q*10}
  sub eax, ecx           {return remainder in eax}
end;

这将执行时间（与上述条件相同）降低到2637ms，但仍不及SSE版本快。从imul到lea的优化是次要的，它可以优化吞吐量上的延迟-可以根据最终使用环境将其应用于所有算法。

Answer 3

收益确实存在，但对实际任务而言意义重大吗？（请注意，我已经更改了参数类型）

procedure DivMod10(Dividend: DWord; var Result, Remainder: DWord);
asm
        PUSH    EDI
        PUSH    ESI
        MOV     EDI, EDX
        MOV     ESI, 10
        XOR     EDX, EDX
        DIV     ESI
        MOV     [ECX], EDX
        MOV     [EDI], EAX
        POP     ESI
        POP     EDI
end;

  1 000 000 000 iterations
  divmod10: 4539
  math.divmod: 7145

使用@Peter Cordes建议的乘法，有限范围- Delphi代码的最快方法。汇编代码的速度较慢（1777年），可能是由于函数调用和我较弱的汇编经验所致。

  b := a * $CCCD;
  b := b shr 19;   //result
  c := a - b * 10;  //remainder
  1 000 000 000 iterations: 1200 ms  (560 ms without remainder)

使用常量from this SO answer可以摆脱变化，但是时间要比Peter和J的版本差：

function DM10(Dividend: DWord;  var Remainder: DWord): DWord;
asm
   push ebx
   mov ebx, eax
   mov ecx, edx
   mov edx, 1999999Ah
   mul eax, edx
   push edx
   lea eax, [edx+edx*4]
   add eax, eax
   sub ebx, eax
   mov [ecx], ebx
   pop eax
   pop ebx
end;

Timings for my machine (10^9 iterations, haswell i5-4670):
this  DM10               2013
Peter Cordes DivMod10    1755
J... SSE version         1685

Answer 4

好的，这是我的尝试：

procedure DivMod10(Num: Cardinal; var Quotient, Remainder: Cardinal);
asm
        PUSH    ESI
        PUSH    EDI
        MOV     EDI,EAX          // Num
        MOV     ESI,EDX          // @Quotient
        MOV     EDX,$CCCCCCCD    
        MUL     EDX              // EDX:EAX = EAX * magic_number
        SHR     EDX,3
        MOV     [ESI],EDX        // --> @Quotient
        LEA     EDX,[EDX+4*EDX]
        ADD     EDX,EDX          // Quotient * 10 
        SUB     EDI,EDX          // Num - Quotient*10
        MOV     [ECX],EDI        // --> @Remainder 
        POP     EDI
        POP     ESI
end;

如果您不需要其余部分：

function Div10(Num: Cardinal): Cardinal;
asm
        MOV     ECX,$CCCCCCCD    
        MUL     ECX   
        SHR     EDX,3
        MOV     EAX,EDX
end;

如何将DivMod优化为常数除数10

4 个答案:

您可以复制/粘贴的Delphi版本