我有一个非常小的循环程序,可以打印从5000000到1的数字。我想让它尽可能快地运行。
我正在学习使用NASM的linux x86-64程序集。
global main
extern printf
main:
push rbx
mov rax,5000000d
print:
push rax
push rcx
mov rdi, format
mov rsi, rax
call printf
pop rcx
pop rax
dec rax
jnz print
pop rbx
ret
format:
db "%ld", 10, 0
答案 0 :(得分:3)
对printf的调用完全支配了即使是非常低效的循环的运行时间。 (你有没有注意到你推/弹rcx,即使你从来没有在任何地方使用它?也许是因为使用the slow LOOP instruction而遗留下来的。)
要了解有关编写高效x86 asm的更多信息,请参阅Agner Fog's Optimizing Assembly guide。 (以及他的微体系结构指南,如果你想深入了解特定CPU的细节以及它们之间的区别:在一个uarch CPU上最佳的可能不在另一个上。例如,IMUL r64在英特尔上具有更好的吞吐量和延迟CPU比AMD要好,但CMOV和ADC在英特尔前Broadwell上是2 uop,2周期延迟。与AMD相比为1,因为3输入ALU m-ops(FLAGS +两个寄存器)对AMD来说不是问题。)另请参阅x86标记wiki中的其他链接。
纯粹优化循环而不更改对printf的5M调用仅作为如何正确编写循环的示例,而不是实际加速此代码。但让我们从那开始:
; trivial fixes to loop efficiently while calling the same slow function
global main
extern printf
main:
push rbx
mov ebx, 5000000 ; don't waste a REX prefix for constants that fit in 32 bits
.print:
;; removed the push/pops from inside the loop.
; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
mov edi, format ; static data / code always has a 32-bit address
mov esi, ebx
xor eax, eax ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
call printf
dec ebx
jnz .print
pop rbx ; restore rbx, the one call-preserved reg we actually used.
xor eax,eax ; successful exit status.
ret
section .rodata ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db "%ld", 10, 0
要加快速度,我们应该利用冗余将连续整数转换为字符串。由于"5000000\n"
只有8个字节长(包括换行符),因此字符串表示适合64位寄存器。
我们可以将该字符串存储到缓冲区中,并按字符串长度递增指针。 (因为对于较小的数字它会变短,只需将当前字符串长度保存在寄存器中,您可以在特殊情况分支中更新它。)
我们可以就地减少字符串表示,以避免(重新)进行除以10的过程,将整数转换为十进制字符串。
由于进位/借位不会在寄存器内自然传播,并且AAS指令在64位模式下不可用(并且仅在AX上工作,甚至不在EAX中,并且速度很慢),我们必须自己做。我们每次都减1,所以我们知道会发生什么。我们可以通过展开10次来处理最不重要的数字,因此没有分支来处理它。
另请注意,由于我们想要打印顺序中的数字,所以进位方向错误,因为x86是little-endian。如果有一个很好的方法来利用我们的字符串在另一个字节顺序,我们可以使用BSWAP或MOVBE。 (但请注意,MOVBE r64是Skylake上的3个融合域uops,其中2个是ALU uops.BSWAP r64也是2 uops。)
也许我们应该在XMM向量寄存器的两半中并行执行奇数/偶数计数器。但是一旦字符串短于8B,那就停止工作了。一次存储一个数字串,我们可以轻松重叠。我们仍然可以在向量寄存器中执行进位传播,并使用MOVQ和MOVHPS分别存储两半。或者因为从0到5M的数字的4/5是7位数,所以我们可以存储一个特殊情况的代码,我们可以存储两个数字的整个16B向量。
处理较短字符串的一种更好的方法: SSSE3 PSHUFB将两个字符串混合到一个向量寄存器中的左包装,然后单个MOVUPS一次存储两个字符串。当字符串长度(位数)发生变化时,只需要更新shuffle掩码,因此不经常执行的进位处理特殊情况代码也可以这样做。
循环的热点部分的矢量化应该非常简单和便宜,并且应该只是性能的两倍。
;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.
section .bss
printbuf: resb 1024*128 + 4096 ; Buffer size ~= half L2 cache size on Intel SnB-family. Or use a giant buffer that we write() once. Or maybe vmsplice to give it away to the kernel, since we only run once.
global main
extern printf
main:
push rbx
; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
mov rdx, `5000000\n` ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10). Note that YASM doesn't support back-ticks for C-style backslash processing.
mov r9, 1<<56 ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
;xor r9d, r9d
;bts r9, 56 ; IDK if this code-size optimization outside the loop would help or not.
mov eax, 8 ; string length.
mov edi, printbuf
.storeloop:
;; rdx = "????x9\n". We compute the start value for the next iteration, i.e. counter -= 10 in rdx.
mov r8, rdx
;; r8 = rdx. We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
;; The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
;; narrower than 8B means that our stores overlap, but that's fine.
;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop. This gives out-of-order execution more to play with.
;; It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).
mov [rdi], r8
sub r8, r9 ; r8 = "xxx8\n"
mov [rdi + rax], r8 ; defer p += len by using a 2-reg addressing mode
sub r8, r9 ; r8 = "xxx7\n"
lea edi, [rdi + rax*2] ; if we had len*3 in another reg, we could defer this longer
;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here. Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible. This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.
;; repeat that block 3 more times.
;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.
;; anyway, then for the last one:
mov [rdi], r8 ; r8 = "xxx1\n"
sub r8, r9
mov [rdi + rax], r8 ; r8 = "xxx0\n"
lea edi, [rdi + rax*2]
;; compute next iteration's RDX. It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
mov rcx, r9
shr rcx, 8 ; maybe hoist this constant out, too
; rcx = 1 in the second-lowest digit
sub rdx, rcx
; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
shl rcx, 5
test rdx, rcx
jz .carry_second_digit
; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
; it only runs once per 100 digits, so it doesn't have to be super-fast
; maybe only do buffer-length checks in the carry-handling branch,
; in which case the jz .carry can be jnz .storeloop
cmp edi, esi ; } while(p < endp)
jbe .storeloop
; write() system call on the buffer.
; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.
pop rbx
xor eax,eax ; successful exit status.
ret
这并没有完全充实,但应该知道什么可能效果很好。
如果使用SSE2进行矢量化,可能会使用标量整数寄存器来跟踪何时需要突破并处理进位。即从10开始的反击。
即使这个标量版本可能接近每个时钟维持一个商店,这使商店端口饱和。它们只有8B个存储(当字符串变短时,有用的部分比这个短),所以如果我们不在缓存未命中的瓶颈,我们肯定会把性能留在桌面上。但是使用3GHz CPU和双通道DDR3-1600(理论最大带宽约为25.6GB / s),每个时钟8B大约足以使单个内核饱和主存储器。
我们可以将其并行化,并将5M .. 1范围分解为块。通过一些聪明的数学运算,我们可以找出写"2500000\n"
的第一个字符的字节,或者我们可以让每个线程以正确的顺序调用write()
。 (或者使用相同的聪明数学让他们独立地使用不同的文件偏移调用pwrite(2)
,因此内核负责处理同一文件的多个写入者的所有同步。)
答案 1 :(得分:1)
你实际上是在打印一个固定的字符串。我将该字符串预先生成一个长常量。
然后该程序成为write
的单个调用(或处理不完整写入的短循环)。