Question

在Visual C ++中，_umul128在定位Windows 32位时未定义。在针对Win32时，如何将两个无符号64位整数相乘？该解决方案只需要针对Windows 32位的Visual C ++ 2017工作。

Answer 1

这个答案有一个xmrrig function的版本来自针对MSVC 32位模式优化的另一个答案。原始版本适用于其他编译器，特别是clang。

我查看了@ Augusto函数的MSVC输出，这真的很糟糕。 使用__emulu代替32x32 =＆gt; 64b乘法显着改善了（因为MSVC是哑的，并且对于已知输入实际上只是32位而上半部分为零的情况不优化uint64_t * uint64_t = uint64_t。其他编译器（gcc和clang）生成单个mul指令而不是调用辅助函数。 MSVC代码的其他问题是，我不知道如何通过调整源代码来修复，但。我想如果你想在该编译器上获得良好的性能，你将不得不使用内联asm（或单独编译的asm函数）。

如果您需要更灵活的任意精度（更大的数字），请参阅具有asm实现的GMPlib's low-level functions，而不是尝试从此__umul128构建256b乘法。但如果你确切需要这个，那么值得尝试。坚持使用C ++可以实现asm不会带来的常量传播和CSE优化。

clang编译时没有出现重大问题，实际上对所有add-with-carry使用adc（除了用setc指令保存的那个）。 MSVC在进位检查上进行分支，只是生成令人讨厌的代码。海湾合作委员会也没有做得很好，有一些分支机构。（因为gcc不知道如何将carry = sum < a变成adc，gcc bug 79173。）

IDK，如果MSVC或gcc支持32位模式下64位整数的任何add-with-carry内在函数。 _addcarry_u64 generates poor code with gcc anyway (in 64-bit mode)，但ICC也可以。 IDK关于MSVC。

如果你想要一个asm实现，我建议使用这个函数的clang 5.0输出。您可能会手动找到一些优化，但它肯定比MSVC更好。但当然https://gcc.gnu.org/wiki/DontUseInlineAsm中的大多数论点都适用：阻塞常数传播是一个主要问题，如果你乘以内联变成常数的东西，或者上半部分被称为零的数字。 / p>

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

铿锵有力的代码。有点MSVC代码不好，但比以前更好。有点gcc也不好（没有变化与其他答案）。

#include <stdint.h>

#ifdef _MSC_VER
#  include <intrin.h>
#else
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic
// But good compilers can just use this to get a single mul instruction
static inline
uint64_t __emulu(uint32_t x, uint32_t y) {
     return x * (uint64_t)y;
}
#endif

// This is still pretty ugly with MSVC, branching on the carry
//  and using XMM store / integer reload to zero a register!
// But at least it inlines 4 mul instructions
//  instead of calling a generic 64x64 => 64b multiply helper function
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF;

    //uint64_t ac = __emulu(a, c);
    uint64_t ad = __emulu(a, d);
    //uint64_t bc = __emulu(b, c);
    uint64_t bd = __emulu(b, d);

    uint64_t adbc = ad + __emulu(b , c);
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0;
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0;
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

确保只在32位代码中使用它。在64位代码中，它无法优化为单个64位mul指令（它产生完整结果的64位半）。实现GNU C ++扩展（clang，gcc，ICC）的编译器可以使用unsigned __int128并获得良好的代码。例如a * (unsigned __int128)b产生128b的结果。（关于Godbolt的例子）。

Answer 2

我找到了以下代码（来自xmrrig），这似乎做得很好：

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Windows 32位_umul128

2 个答案: