Question

最近，我试图编写一个程序来计算（a * b）％m，其中（0 <= a，b，m <= 2 ^ 63-1）。而且，幸运的是，我知道GCC支持__int128_t。所以我最终得到了以下程序。

#include <stdint.h>

int64_t multimod(int64_t a, int64_t b, int64_t m)
{
  __int128_t ab = (__int128_t)a * b;
  ab %= m;
  return ab;
}

但是我想在没有__int128_t的情况下执行此操作，以挑战自己并提高此功能的效率。我决定通过首先模拟该功能的汇编程序的过程来做到这一点。因此，我使用了objdump并获得了multimod的以下部分。

int64_t multimod(int64_t a, int64_t b, int64_t m)
{
 720:   55                      push   %rbp
 721:   49 89 d1                mov    %rdx,%r9 
 724:   49 89 f8                mov    %rdi,%r8
 727:   49 c1 f8 3f             sar    $0x3f,%r8
 72b:   48 89 f0                mov    %rsi,%rax
 72e:   48 c1 f8 3f             sar    $0x3f,%rax
 732:   4c 89 c2                mov    %r8,%rdx
 735:   48 0f af d6             imul   %rsi,%rdx
 739:   48 0f af c7             imul   %rdi,%rax
 73d:   49 89 c0                mov    %rax,%r8 
 740:   49 01 d0                add    %rdx,%r8 
 743:   48 89 f8                mov    %rdi,%rax
 746:   48 f7 e6                mul    %rsi
 749:   48 89 c7                mov    %rax,%rdi
 74c:   49 8d 34 10             lea    (%r8,%rdx,1),%rsi
 750:   4c 89 c9                mov    %r9,%rcx
 753:   48 c1 f9 3f             sar    $0x3f,%rcx
 757:   4c 89 ca                mov    %r9,%rdx
 75a:   e8 61 00 00 00          callq  7c0 <__modti3>
 75f:   5d                      pop    %rbp
 760:   c3                      retq

我分析了整个部分，并认为可以将其分为两部分：1.获得64位变量a和b的正确128位乘积。2. {{ 1}}。

我是STFW，并且知道__modti3的原型是__modti3。但是汇编代码却不是这样。调用long long __modti3(long long a, long long b)时，第一个参数__modti3包含%rdi和a的乘积的低64位，第二个参数b包含高64位的%rsi和a的乘积，第三个参数b包含%rdx。那么m为获得正确答案做了什么？

Answer 1

否，long long是64位。您可以看到gcc在rdi，rsi，rdx和rcx中传递了__modti3 args。（即x86-64 SysV ABI中的前4个arg-pass插槽。）

所以这是两个128位操作数，分别按值在rsi:rdi和rcx:rdx对成对中传递。

实际上是__int128 __modti3(__int128 quotient, __int128 divisor); ，这就是存在的全部要点和原因：x86-64在硬件中剩余long long % long long，具有
idiv r64，gcc将其用于运行时变量除数/模。

请注意，您的函数是使用

将m从rdx符号扩展到rcx:rdx中

mov    %r9, %rcx        # originally from RDX on entry; you didn't enable full optimization
sar    $63, %rcx        # copy sign bit to all bit positions.

这就像cqo（AT＆T cqto）将RAX签名扩展到RDX：RAX的操作一样。

顺便说一句，如果使用-O3启用全面优化，则代码更易于阅读。然后，您仅使用1位乘法指令，使用64位输入并产生128位输出。 https://gcc.godbolt.org/z/0gKc5d

如果您想使asm看起来更像源代码，则使用-O1或-Og进行编译有时会更有帮助，但是由于C没有宽乘运算符，因此您实际上并不需要那。您想要编译器在乘法成扩宽乘法之前先优化扩宽输入，而不是对输入进行符号扩展成寄存器对并进行128x128 => 128位乘法。（您显示的代码中发生了什么。）

__modti3是做什么的？

1 个答案: