有时gcc使用32位寄存器,当我希望它使用64位寄存器时。例如以下C代码:
unsigned long long
div(unsigned long long a, unsigned long long b){
return a/b;
}
使用-O2选项编译(省略一些样板文件):
div:
movq %rdi, %rax
xorl %edx, %edx
divq %rsi
ret
对于无符号除法,寄存器%rdx
必须为0
。这可以通过xorq %rdx, %rdx
来实现,但xorl %edx, %edx
似乎具有相同的效果。
至少在我的机器上,xorl
比xorq
没有性能提升(即加速)。
我实际上不只是一个问题:
xorl
停止并且不使用xorw
?xorl
比xorq
更快的机器?答案 0 :(得分:14)
在64位模式下写入32位寄存器时,高位32位=>
xorl %edx, %edx
将rdx
的上半部分归为"免费"。
另一方面,xor %rdx, %rdx
使用额外字节进行编码,因为它需要REX前缀。
当尝试将64位寄存器归零时,将其作为32位寄存器是明显的胜利。
答案 1 :(得分:8)
Why does gcc prefer the 32bit version?
Code size: no REX prefix needed.
Why does gcc stop at
xorl
and doesn't usexorw
?
Writing a 16bit partial register doesn't zero-extend to the rest of the register. Besides, xorw
requires an operand-size prefix to encode, so it's larger than xorl
. (See also Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? for historical background)
See also Why doesn't GCC use partial registers? 32-bit registers are not considered partial registers, because writing them always writes the whole 64-bit register. (And it's writing partial regs that's the main problem, not reading them after a full-width write.)
Are there machines for which xorl is faster than xorq?
Yes, Silvermont / KNL only recognize xor
-zeroing as a zeroing idiom (dependency breaking, and other good stuff) with 32-bit operand size. Thus, even though code-size is the same, xor %r10d, %r10d
is much better than xor %r10, %r10
. (xor
needs a REX prefix for r10
regardless of operand-size).
On all CPUs, code size always potentially matters for decode and I-cache footprint (except when a later .p2align
directive would just make more padding if the preceding code is smaller1). There's no downside to using 32-bit operand size for xor-zeroing (or to implicit zero-extending in general instead of explict2, including using AVX vpxor xmm0,xmm0,xmm0
to zero AVX512 zmm0.)
Most instructions are the same speed for all operand-sizes, because modern x86 CPUs can afford the transistor budget for wide ALUs. Exceptions include imul r64,r64
is slower than imul r32,r32
on AMD CPUs before Ryzen, and Intel Atom, and 64bit div
is significantly slower on all CPUs. AMD pre-Ryzen has slower popcnt r64
. Atom/Silvermont have slow shld/shrd r64
vs. r32
Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?
Yes, prefer 32-bit ops for code-size reasons at least, but note that using r8..r15 anywhere in an instruction (including an addressing mode) will also require a REX prefix. So if you have some data you can use 32-bit operand-size with (or pointers to 8/16/32-bit data), prefer to keep it in the low 8 named registers (e/rax..) rather than high 8 numbered registers.
But don't spend extra instructions to make this happen; saving a few bytes of code-size is usually the least important consideration. e.g. just use r8d
instead of saving/restoring rbx
so you can use ebx
if you need an extra register that doesn't have to be call-preserved. Using 32-bit r8d
instead of 64-bit r8
won't help with code-size, but it can be faster for some operations on some CPUs (see above).
This also applies to cases where you only care about the low 16 bits of a register, but it can still be more efficient to use a 32-bit add instead of 32-bit.
See also http://agner.org/optimize/ and the x86 tag wiki.
Footnote 1: There are rare use-cases for making instructions longer than necessary (What methods can be used to efficiently extend instruction length on modern x86?)
To align a later branch target without needing a NOP.
Tuning for the front-end of a specific microarchitecture (i.e. optimizing decode by controlling where instructions boundaries are). Inserting NOPs would cost extra front-end bandwidth and completely defeat the whole purpose.
Assemblers won't do this for you, and doing it by hand is time consuming to re-do every time you change anything (and you may have to use .byte
directives to manually encode the instruction).
Footnote 2: I've found one exception to the rule that implicit zero-extension is at least as cheap as a wider operation: Haswell/Skylake AVX 128-bit loads being read by a 256-bit instruction have an extra 1c of store-forwarding latency vs. being consumed by a 128-bit instruction. (Details in a thread on Agner Fog's blog forum.)