Question

我希望我的C函数能够有效地计算两个64位有符号整数的乘积的高64位。我知道如何在x86-64程序集中执行此操作，使用imulq并将结果从％rdx中拉出。但我完全不知道如何在C语言中写这个，更不用说哄骗编译器有效地做到这一点。

有人有任何建议用C写这个吗？这是性能敏感的，因此“手动方法”（如俄罗斯农民或bignum图书馆）已经出局。

我写的这个笨拙的内联汇编函数很有效，大致是我追求的代码：

static long mull_hi(long inp1, long inp2) {
    long output = -1;
    __asm__("movq %[inp1], %%rax;"
            "imulq %[inp2];"
            "movq %%rdx, %[output];"
            : [output] "=r" (output)
            : [inp1] "r" (inp1), [inp2] "r" (inp2)
            :"%rax", "%rdx");
    return output;
}

Answer 1

如果您在x86_64上使用相对较新的GCC：

int64_t mulHi(int64_t x, int64_t y) {
    return (int64_t)((__int128_t)x*y >> 64);
}

在-O1及更高版本，这会编译为您想要的内容：

_mulHi:
0000000000000000    movq    %rsi,%rax
0000000000000003    imulq   %rdi
0000000000000006    movq    %rdx,%rax
0000000000000009    ret

我相信clang和VC ++也支持__int128_t类型，所以这也应该适用于那些平台，并且通常需要自己尝试一下。

Answer 2

一般答案是x * y可以细分为(a + b) * (c + d)，其中a和c是高阶部分。

首先，展开到ac + ad + bc + bd

现在，您将这些术语乘以存储为long long的32位数（或更好，uint64_t），您只记得当您乘以更高阶数时，需要按比例缩放32位。然后你做了添加，记得检测携带。跟踪标志。当然，你需要做一些补充。

有关实现上述内容的代码，请参阅my other answer。

Answer 3

关于装配解决方案，请勿对mov指令进行硬编码！让编译器为您完成。这是您的代码的修改版本：

static long mull_hi(long inp1, long inp2) {
    long output;
    __asm__("imulq %2"
            : "=d" (output)
            : "a" (inp1), "r" (inp2));
    return output;
}

有用的参考：Machine Constraints

Answer 4

由于您在使用机器代码解决自己的问题方面做得很好，我认为您应该对便携版本有所帮助。如果在x86上使用gnu，我会在ifdef处留下你只使用程序集。

无论如何，这是一个基于my general answer的实现。我很确定这是正确的，但没有保证，我昨晚就把它搞砸了。您可能应该摆脱静态positive_result[]和result_negative - 这些只是我单位测试的人工制品。

#include <stdlib.h>
#include <stdio.h>

// stdarg.h doesn't help much here because we need to call llabs()

typedef unsigned long long uint64_t;
typedef   signed long long  int64_t;

#define B32 0xffffffffUL

static uint64_t positive_result[2]; // used for testing
static int result_negative;         // used for testing

static void mixed(uint64_t *result, uint64_t innerTerm)
{
  // the high part of innerTerm is actually the easy part

    result[1] += innerTerm >> 32;

  // the low order a*d might carry out of the low order result

    uint64_t was = result[0];

    result[0] += (innerTerm & B32) << 32;

    if (result[0] < was) // carry!
      ++result[1];
}


static uint64_t negate(uint64_t *result)
{
  uint64_t t = result[0] = ~result[0];
  result[1] = ~result[1];
  if (++result[0] < t)
    ++result[1];
  return result[1];
}

uint64_t higherMul(int64_t sx, int64_t sy)
{
    uint64_t x, y, result[2] = { 0 }, a, b, c, d;

    x = (uint64_t)llabs(sx);
    y = (uint64_t)llabs(sy);

    a = x >> 32;
    b = x & B32;
    c = y >> 32;
    d = y & B32;

  // the highest and lowest order terms are easy

    result[1] = a * c;
    result[0] = b * d;

  // now have the mixed terms ad + bc to worry about

    mixed(result, a * d);
    mixed(result, b * c);

  // now deal with the sign

    positive_result[0] = result[0];
    positive_result[1] = result[1];
    result_negative = sx < 0 ^ sy < 0;
    return result_negative ? negate(result) : result[1];
}

Answer 5

等等，您已经拥有了一个非常好的优化装配解决方案为此工作，你想要支持它并尝试写入一个不支持128位数学的环境？我不是在跟踪。

正如您明显知道的那样，此操作只需一条指令即可 X86-64。显然，你所做的一切都不会让它更好地发挥作用。如果你真的想要便携式C，你需要做类似的事情上面是DigitalRoss的代码，希望你的优化器找出什么你正在做。

如果您需要架构可移植性但愿意限制自己到gcc平台，有__int128_t（和__uint128_t）类型编译器内在函数，它会做你想要的。

在C中计算64x64 int产品的高64位

5 个答案: