Question

我正在尝试从this article on x86 assembly floating point编译以下代码示例（NASM语法）：

;; c^2 = a^2 + b^2 - cos(C)*2*a*b
;; C is stored in ang

global _start

section .data
    a: dq 4.56   ;length of side a
    b: dq 7.89   ;length of side b
    ang: dq 1.5  ;opposite angle to side c (around 85.94 degrees)

section .bss
    c: resq 1    ;the result ‒ length of side c

section .text
    _start:

    fld qword [a]   ;load a into st0
    fmul st0, st0   ;st0 = a * a = a^2

    fld qword [b]   ;load b into st1
    fmul st1, st1   ;st1 = b * b = b^2

    fadd st1, st0   ;st1 = a^2 + b^2

    fld qword [ang] ;load angle into st0
    fcos            ;st0 = cos(ang)

    fmul qword [a]  ;st0 = cos(ang) * a
    fmul qword [b]  ;st0 = cos(ang) * a * b
    fadd st0, st0   ;st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)

    fsubp st1, st0  ;st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
                    ;and pop st0

    fsqrt           ;take square root of st0 = c

    fst qword [c]   ;store st0 in c ‒ and we're done!

当我执行以下命令时：

nasm -f elf32 cosineSample.s -o cosineSample.o

在行fmul st1, st1上出现以下错误：

error: invalid combination of opcode and operands

该怎么办才能解决此问题？我是否需要将特殊参数传递给nasm？代码示例错误吗？

Answer 1

不幸的是，该代码已损坏。 fmul无法操作上st1, st1，但即使它，它不会做笔者想要什么。按照该意见，他想要计算b*b {但{1}在b在这一点上。注释st0是错误的，load b into st1总是加载到fld（堆栈的顶部）。您需要更改st0到fmul st1, st1。此外，为了获得正确的结果，还必须颠倒以下fmul st0, st0。该代码还会使fpu堆栈变脏。

另外请注意，程序没有结束，所以它会出现段错误，除非你添加一个明确的fadd st1, st0系统调用。

这是固定代码，转换为gnu汇编器语法：

exit

Answer 2

我修复了Wikibooks上的代码，并添加了一些额外的注释（Jester的回答很好），因此，现在它可以组装并正确运行（已通过GDB测试，使用layout ret / tui reg float单步执行）。 This is the diff between revisions。引入了fmul st1,st1无效指令错误is here的修订版，但甚至在此之前，它在完成x87堆栈后仍无法清除。

只是为了好玩，我想编写一个效率更高的版本，仅加载一次a和b。

并且通过首先进行涉及cos结果的所有 not 操作，可以实现更多的指令级并行性。即在将2*a*b乘以cos(ang)之前准备fcos，以便这些计算都可以并行进行。假设fmul是关键路径，我的版本从fsubp结果到fcos输入只有一个fsqrt和一个default rel ; in case we assemble this in 64-bit mode, use RIP-relative addressing ... declare stuff, omitted. fld qword [a] ;load a into st0 fld st0 ; st1 = a because we'll need it again later. fmul st0, st0 ;st0 = a * a = a^2 fld qword [b] ;load b into st0 (pushing the a^2 result up to st1) fmul st2, st0 ; st2 = a*b fmul st0, st0 ;st0 = b^2, st1 = a^2, st2 = a*b faddp ;st0 = a^2 + b^2 st1 = a*b; st2 empty fxch st1 ;st0 = a*b st1 = a^2 + b^2 ; could avoid this, but only by using cos(ang) earlier, worse for critical path latency fadd st0,st0 ;st0 = 2*a*b st1 = a^2 + b^2 fld qword [ang] fcos ;st0 = cos(ang) st1 = 2*a*b st2 = a^2+b^2 fmulp ;st0=cos(ang)*2*a*b st1 = a^2+b^2 fsubp st1, st0 ;st0 = (a^2 + b^2) - (2 * a * b * cos(ang)) fsqrt ;take square root of st0 = c fstp qword [c] ;store st0 in c and pop, leaving the x87 stack empty again ‒ and we're done!的延迟。

double

当然，x87已经过时了。在现代x86上，通常对任何浮点都使用SSE2标量（或打包！）。

x87在现代x86上有两点要做：硬件上的80位精度（与64位fcos相比），并且对于较小的代码大小（机器代码字节，而不是位数）很有用。说明或来源大小）。好的指令高速缓存通常意味着代码大小并不是使x87值得FP代码性能的重要因素，因为它通常比SSE2慢，因为要处理笨重的x87堆栈。

出于初学者或出于代码大小的原因，x87具有超越功能，例如fsin和extern cos global cosine_law_sse2_scalar cosine_law_sse2_scalar: movsd xmm0, [ang] call cos ; xmm0 = cos(ang). Avoid using this right away so OoO exec can do the rest of the work in parallel movsd xmm1, [a] movsd xmm2, [b] movaps xmm3, xmm1 ; copying registers should always copy the full reg, not movsd merging into the old value. mulsd xmm3, xmm2 ; xmm3 = a*b mulsd xmm1, xmm1 ; a^2 mulsd xmm2, xmm2 ; b^2 addsd xmm3, xmm3 ; 2*a*b addsd xmm1, xmm2 ; a^2 + b^2 mulsd xmm3, xmm0 ; 2*a*b*cos(ang) subsd xmm1, xmm3 ; (a^2 + b^2) - 2*a*b*cos(ang) sqrtsd xmm0, xmm3 ; sqrt(that), in xmm0 as a return value ret ;; This has the work interleaved more than necessary for most CPUs to find the parallelism，并且内置了log / exp作为单个指令。它们用许多微码进行了微编码，并且可能不比标量库函数快，但是在某些CPU上，您可能会对它们进行的速度/精度折衷以及绝对速度感到满意。至少如果您首先使用的是x87，否则必须通过存储/重新加载将结果与XMM寄存器进行反弹。

sin / cos的范围缩减不会使用任何扩展精度的东西来避免非常接近Pi倍数的巨大相对误差，仅使用Pi的内部80位（64位有效数字）值即可。（库实现可能会或可能不会这样做，具体取决于所需的速度与精度的权衡。）请参见Intel Underestimates Error Bounds by 1.3 quintillion。

（当然，用32位代码的x87可以使您与奔腾III和其他没有双倍使用SSE2，只有SSE1用于浮点或根本没有XMM寄存器的CPU兼容。x86-64具有SSE2作为基线，因此这种优势在x86-64上不存在。）

对于初学者而言，x87的巨大缺点在于跟踪x87堆栈寄存器，而不是使内容堆积。您可以轻松地得到一次可以使用的代码，但是当您将其放入循环中时会给出NaN，因为您没有平衡x87堆栈操作。

call cos

此版本在2*a*b返回后只有11微码。（https://agner.org/optimize/）。它非常紧凑，非常简单。无法跟踪x87堆栈。并且它具有与x87相同的依赖链，在我们已经拥有a之前不使用cos结果。

我们甚至可以一起将b和b^2作为一个128位向量加载。但是，将其拆箱以将两个半部分做不同的事情，或者从顶部元素中获取haddpd作为标量，则很笨拙。如果SSE3 a*b + a*b只有1个uop会很棒（让我们用一条指令执行a^2 + b^2和movaps，给定正确的输入），但是在所有拥有它的CPU上，它都是3哎呀。

（PS与PD仅对诸如MULSS / SD之类的实际数学指令起作用。对于FP随机播放和寄存器副本，只需使用将FP指令获取所需数据的任何方法，并优先选择PS / SS，因为它们较短这就是为什么我使用movapd; ;; I didn't actually end up using SSE3 for movddup or haddpd, it turned out I couldn't save uops that way. global cosine_law_sse3_less_shuffle cosine_law_sse3_less_shuffle: ;; 10 uops after the call cos, if both extract_high_half operations use pshufd or let movhlps have a false dependency ;; or if we had AVX for vunpckhpd xmm3, xmm1,xmm1 ;; and those 10 are a mix of shuffle and MUL/ADD. movsd xmm0, [ang] call cos ; xmm0 = cos(ang). Avoid using this right away so OoO exec can do the rest of the work in parallel movups xmm1, [a] ; {a, b} (they were in contiguous memory in this order. low element = a) movaps xmm3, xmm1 ; xorps xmm3, xmm3 ; break false dependency by zeroing. (xorps+movhlps is maybe better than movaps + unpckhpd, at least on SnB but maybe not Bulldozer / Ryzen) ; movhlps xmm3, xmm1 ; xmm3 = b ; pshufd xmm3, xmm1, 0b01001110 ; xmm3 = {b, a} ; bypass delay on Nehalem, but fine on most others mulsd xmm3, [b] ; xmm3 = a*b ; reloading b is maybe cheaper than shufling it out of the high half of xmm1 addsd xmm3, xmm3 ; 2*b*a mulsd xmm3, xmm0 ; 2*b*a*cos(ang) mulpd xmm1, xmm1 ; {a^2, b^2} ;xorps xmm2, xmm2 ; we don't want to just use xmm0 here; that would couple this dependency chain to the slow cos(ang) critical path sooner. movhlps xmm2, xmm1 addsd xmm1, xmm2 ; a^2 + b^2 subsd xmm1, xmm3 ; (a^2 + b^2) - 2*a*b*cos(ang) sqrtsd xmm0, xmm1 ; sqrt(that), in xmm0 as a return value ret总是会浪费1个字节的优化遗漏，除非您故意增加指令的对齐时间。）

pshufd

我们可以使用AVX做得更好，保存MOVAPS寄存器副本，因为3运算符无损VEX版本的指令使我们可以将结果放入新的寄存器中，而不会破坏任何输入。这对于FP随机播放确实非常有用，因为SSE 对于FP操作数没有任何复制和随机播放，只有global cosine_law_avx cosine_law_avx: ;; 9 uops after the call cos. Reloading [b] is good here instead of shuffling it, saving total uops / instructions vmovsd xmm0, [ang] call cos ; xmm0 = cos(ang). Avoid using this right away so OoO exec can do the rest of the work in parallel vmovups xmm1, [a] ; {a, b} (they were in contiguous memory in this order. low element = a) vmulsd xmm3, xmm1, [b] ; xmm3 = a*b vaddsd xmm3, xmm3 ; 2*b*a. (really vaddsd xmm3,xmm3,xmm3 but NASM lets us shorten when dst=src1) vmulsd xmm3, xmm0 ; 2*b*a*cos(ang) vmulpd xmm1, xmm1 ; {a^2, b^2} vunpckhpd xmm2, xmm1,xmm1 ; xmm2 = { b^2, b^2 } vaddsd xmm1, xmm2 ; a^2 + b^2 vsubsd xmm1, xmm3 ; (a^2 + b^2) - 2*a*b*cos(ang) vsqrtsd xmm0, xmm1,xmm1 ; sqrt(that), in xmm0 as a return value. (Avoiding an output dependency on xmm0, even though it was an ancestor in the dep chain. Maybe lets the CPU free that physical reg sooner) ret会在某些CPU上引起额外的旁路延迟。因此，它节省了MOVAPS和（注释掉的）XORPS，从而打破了对MOVHLPS产生XMM2旧值的依赖。（MOVHLPS将目标的低64位替换为src的高64位，因此它对两个寄存器都有输入依赖性。

{{1}}

我只测试了第一个x87版本，所以有可能我错过了其中一个。

NASM浮点数-操作码和操作数的无效组合

2 个答案: