Question

我想将％RCX 中的值直接打印到控制台，让我们说一个ASCII值。我搜索了一些明智的书籍和教程，但都使用缓冲区来传递任何东西。是否可以在不为此目的创建特殊缓冲区的情况下打印任何内容？

让我说我在这里（所有这些答案对我来说都太复杂了，使用不同的语法）：

for (int[] arr : twoDimensionArray) {
                        for(int i = 0; i < arr.length; i++) {
                            if (arr[i] == -1) {
                                arr[i] = 'C';
                            }
                        }
                        System.out.println(Arrays.toString(arr));
                    }

控制台输出：

Sub newyear()

   Dim month(12) As String
   Dim i As Integer

   month(1) = "January"
   month(2) = "February"
   month(3) = "March"
   month(4) = "April"
   month(5) = "May"
   month(6) = "June"
   month(7) = "July"
   month(8) = "August"
   month(9) = "September"
   month(10) = "October"
   month(11) = "November"
   month(12) = "Dezember"

   On Error Resume Next
   For i = 1 To 12
       If Worksheets(month(i)) Is Nothing Then
           Worksheets.Add(After:=Worksheets(Worksheets.Count)).Name = month(i)
       End If
   Next

End Sub

例如，打印缓冲区我使用代码：

movq $5, %rax
...???(print %rax)

不允许使用C代码或不同的ASS语言!!!

Answer 1

为了打印寄存器（十六进制表示或数字），例程（写入stdout，stderr等）需要ASCII字符。只需写一个寄存器就会使例程尝试显示寄存器中ascii等值的值。如果寄存器中的每个字节恰好落入可打印字符范围，您可能会很幸运。

您需要将其转换为转换为十进制或十六进制的例程。下面是将64位寄存器转换为十六进制表示的示例（使用intel语法w / nasm）：

section .rodata

hex_xlat:        db "0123456789abcdef"

section .text

; Called with RDI is the register to convert and
; RSI for the buffer to fill
; 
register_to_hex:
    push    rsi                 ; Save for return

    xor     eax,eax
    mov     ecx, 16             ; looper
    lea     rdx, [rel hex_xlat]  ; position-independent code can't index a static array directly

ALIGN 16
.loop:
    rol     rdi, 4              ; dil now has high bit nibble
    mov     al, dil             ; capture low nibble
    and     al, 0x0f
    mov     al, byte [rdx+rax]  ; look up the ASCII encoding for the hex digit
                                 ; rax is an 'index' with range 0x0 - 0xf.
                                 ; The upper bytes of rax are still zero from xor
    mov     byte [rsi], al      ; store in print buffer
    inc     rsi                 ; position next pointer
    dec     ecx
    jnz    .loop

.exit:
    pop     rax                 ; Get original buffer pointer
    ret

Answer 2

这个答案是弗兰克给出的答案的附录，并利用那里使用的机制进行转换。

您在提问时提到了注册％RCX 。这表明您正在查看64位代码，并且您的环境可能是 GCC / GAS （GNU Assembler），因为%通常是AT＆amp; T寄存器的样式前缀。

考虑到这一点，我创建了一个快速而脏的宏，可以在任何需要打印64位寄存器，64位内存操作数或GNU程序集中的32位立即值的内联使用。此版本是概念证明，可以修改为支持64位立即值。保留所有使用的寄存器，代码也将考虑Linux 64-bit System V ABI红色区域。

下面的代码被注释，以指出每一步发生的事情。

<强> printmac.inc ：

.macro memreg_to_hex src            # Macro takes one input
                                    #  src = memory operand, register,
                                    #        or 32 bit constant to print

    # Define the translation table only once for the current object
    .ifndef MEMREG_TO_HEX_NOT_FIRST
        .set MEMREG_TO_HEX_NOT_FIRST, 1
        .PushSection .rodata
            hex_xlat: .ascii "0123456789abcdef"
        .PopSection
    .endif

    add    $-128,%rsp               # Avoid 128 byte red zone
    push   %rsi                     # Save all registers that will be used
    push   %rdi
    push   %rdx
    push   %rcx
    push   %rbx
    push   %rax
    push   %r11                     # R11 is destroyed by SYSCALL

    mov  \src, %rdi                 # Move src value to RDI for processing

    # Output buffer on stack at ESP-16 to ESP-1
    lea    -16(%rsp),%rsi           # RSI = output buffer on stack
    lea    hex_xlat(%rip), %rdx     # RDX = translation buffer address
    xor    %eax,%eax                # RAX = Index into translation array
    mov    $16,%ecx                 # 16 nibbles to print

.align 16
1:
    rol    $4,%rdi                  # rotate high nibble to low nibble
    mov    %dil,%al                 # dil now has previous high nibble
    and    $0xf,%al                 # mask off all but low nibble
    mov    (%rdx,%rax,1),%al        # Lookup in translation table
    mov    %al,(%rsi)               # Store in output buffer
    inc    %rsi                     # Update output buffer address
    dec    %ecx
    jne    1b                       # Loop until counter is 0

    mov    $1,%eax                  # Syscall 1 = sys_write
    mov    %eax,%edi                # EDI = 1 = STDIN
    mov    $16,%edx                 # EDX = Number of chars to print
    sub    %rdx,%rsi                # RSI = beginning of output buffer
    syscall

    pop    %r11                     # Restore all registers used
    pop    %rax
    pop    %rbx
    pop    %rcx
    pop    %rdx
    pop    %rdi
    pop    %rsi
    sub    $-128,%rsp               # Restore stack
.endm

<强> printtest.s

.include "printmac.inc"

.global main
.text
main:
    mov $0x123456789abcdef,%rcx
    memreg_to_hex %rcx               # Print the 64-bit value 0x123456789abcdef
    memreg_to_hex %rsp               # Print address containing ret pointer
    memreg_to_hex (%rsp)             # Print return pointer
    memreg_to_hex $0x402             # Doesn't support 64-bit immediates
                                     #  but can print anything that fits a DWORD
    retq

这可以编译并链接：

gcc -m64 printtest.s -o printtest

宏不会打印行尾字符，因此测试程序的输出如下：

0123456789abcdef00007fff5283d74000007f5c4a080a500000000000000402

内存地址会有所不同。

由于宏是内联的，因此每次调用宏时都会发出整个代码。代码空间效率低下。可以将大部分代码移动到链接时可以包含的目标文件。然后，存根宏可以将CALL包装到主打印功能。

代码没有使用printf，因为在某些时候我认为我看到了一条您无法使用 C 库的评论。如果不是这种情况，可以通过调用printf格式化输出以打印64位十六进制值来大大简化。

Answer 3

只是为了好玩，这里有几个其他序列用于存储寄存器中的十六进制字符串。打印缓冲区不是有趣的部分，IMO;如果需要的话，从迈克尔的优秀答案中复制那部分。

我测试了其中一些。我已经添加了一个调用其中一个函数的main，然后使用printf("%s\n%lx\n", result, test_value);来轻松发现问题。

测试`main()`：

extern printf

global main
main:
        push    rbx
        mov     rdi, 0x1230ff56dcba9911
        mov     rbx, rdi

        sub     rsp, 32
        mov     rsi, rsp
        mov     byte [rsi+16], 0
        call register_to_hex_ssse3

        mov     rdx, rbx
        mov     edi, fmt
        mov     rsi, rsp
        xor     eax,eax
        call    printf

        add     rsp, 32
        pop     rbx
        ret

section .rodata
fmt:    db `%s\n%lx\n`,  0      ; YASM doesn't support `string with escapes`, so this only assembles with NASM.
    ;  NASM needs 
    ; %use smartalign
    ; ALIGNMODE p6, 32
    ; or similar, to stop it using braindead repeated single-byte NOPs for ALIGN

LUT

的SSSE3 pshufb

此版本不需要循环，但代码大小比旋转循环版本大得多，因为SSE指令更长。

section .rodata
ALIGN 16
hex_digits:
hex_xlat:        db "0123456789abcdef"

section .text

    ;; rdi = val  rsi = buffer
ALIGN 16
global register_to_hex_ssse3
register_to_hex_ssse3:       ;;;; 0x39 bytes of code
    ;; use PSHUFB to do 16 nibble->ASCII LUT lookups in parallel
    movaps  xmm5, [rel hex_digits]
    ;; x86 is little-endian, but we want the hex digit for the high nibble to be the first character in the string
    ;; so reverse the bytes, and later unpack nibbles like [ LO HI ... LO HI ]
    bswap   rdi
    movq    xmm1, rdi

    ;; generate a constant on the fly, rather than loading
    ;; this is a bit silly: we already load the LUT, might as well load another 16B from the same cache line, a memory operand for PAND since we manage to only use it once
    pcmpeqw xmm4,xmm4
    psrlw   xmm4, 12
    packuswb xmm4,xmm4  ; [ 0x0f 0x0f 0x0f ... ] mask for low-nibble of each byte

    movdqa  xmm0, xmm1  ; xmm0 = low  nibbles at the bottom of each byte
    psrlw   xmm1, 4     ; xmm1 = high nibbles at the bottom of each byte (with garbage from next byte)
    punpcklbw xmm1, xmm0    ; unpacked nibbles (with garbage in the high 4b of some bytes)

    pand    xmm1, xmm4  ; mask off the garbage bits because pshufb reacts to the MSB of each element.  Delaying until after interleaving the hi and lo nibbles means we only need one
    pshufb  xmm5, xmm1  ; xmm5 = the hex digit for the corresponding nibble in xmm0
    movups  [rsi], xmm5
    ret

AVX2 ：您可以同时执行两个整数，例如

int64x2_to_hex_avx2:    ; (const char buf[32], uint64_t first, uint64_t second)
bswap      rsi          ; We could replace the two bswaps with one 256b vpshufb, but that would require a mask
vmovq      xmm1, rsi
bswap      rdx
vpinsrq    xmm1, xmm1, rdx, 1
vpmovzxbw  ymm1, xmm1          ; upper lane = rdx, lower lane = rsi, with each byte zero-extended to a word element
vpsllw     ymm1, ymm1, 12      ; shift the high nibbles out, leaving the low nibbles at the top of each word
vpor       ymm0, ymm0, ymm1    ; merge while hi and lo elements both need the same shift
vpsrlw     ymm1, ymm1, 4       ; low  nibbles in elems 1, 3, 5, ...
                               ; high nibbles in elems 0, 2, 4, ...
pshufb / store ymm0 / ret

使用pmovzx和shift来避免pand是一个胜利，而不是动态生成常量，我想，但可能不是。它需要2个额外的班次和por。它是16B非AVX版本的选项，但它是SSE4.1。

针对代码大小进行了优化（适合32（0x20）字节）

（源自弗兰克的循环）

使用cmov而不是LUT处理0-9与a-f相比，可能需要少于16B的额外代码大小。这可能很有趣：编辑欢迎。

将rsi底部的半字节变为另外归零rax的方法包括：

mov al, sil（3B（sil所需的REX））/ and al, 0x0f（and al, imm8的2B特殊编码）。
mov eax, esi（2B）/ and eax, 0x0f（3B）：大小相同，并且事先不需要xor将rax的高位字节归零。

如果args被反转会更小，所以dest缓冲区已经在rdi中了。 stosb是一个很小的指令（但比mov [rdi], al / inc rdi慢），因此它实际上保存了整个字节以使用xchg rdi, rsi来设置它。 更改函数签名可以节省5个字节：void reg_to_hex(char buf[16], uint64_t val)可以节省两个字节，而不必在buf中返回rax，而从xchg中删除3个字节1}}。调用者可能会使用16B的堆栈，并且在调用缓冲区上的另一个函数/系统调用之前让调用者执行mov rdx, rsp而不是mov rdx, rax不会保存任何内容。

然而，下一个函数可能会转到ALIGN 16，因此将函数缩小到甚至小于32B并不如将其放入半个缓存行中那么有用。

LUT（hex_xlat）的绝对寻址将节省几个字节（使用mov al, byte [hex_xlat + rax]而非需要lea）。

global register_to_hex_size
register_to_hex_size:
    push    rsi             ; pushing/popping return value (instead of  mov rax, rsi) frees up rax for stosb
    xchg    rdi, rsi        ; allows stosb.  Better: remove this and change the function signature
    mov     cl, 16          ; 3B shorter than mov ecx, 16
    lea     rdx,  [rel hex_xlat]

;ALIGN 16
.loop:
    rol     rsi, 4
    mov     eax, esi          ; mov al, sil  to allow 2B AND AL,0xf  requires a 2B xor eax,eax
    and     eax, 0x0f
    mov     al, byte [rdx+rax]
    stosb
      ;; loop .loop  ; setting up ecx instead of cl takes more bytes than loop saves
    dec     cl
    jne    .loop
    pop     rax              ; get the return value back off the stack
    ret

使用xlat成本2B（保存/恢复rbx），但保存3B，净节省为1B。这是一个3-uop指令，具有7c延迟，每2c吞吐量一个（Intel Skylake）。延迟和吞吐量在这里不是问题，因为每次迭代都是一个单独的依赖链，并且无论如何每次迭代以一个时钟运行的开销太大。所以主要的问题是它是3 uops，使其对uop-cache-friendly更少。使用xlat，循环变为10 uops而不是8（使用stosb），因此很糟糕。

 112:   89 f0                   mov    eax,esi
 114:   24 0f                   and    al,0xf
 116:   d7                      xlat   BYTE PTR ds:[rbx]
 117:   aa                      stos   BYTE PTR es:[rdi],al

VS

  f1:   89 f0                   mov    eax,esi
  f3:   83 e0 0f                and    eax,0xf
  f6:   8a 04 02                mov    al,BYTE PTR [rdx+rax*1]
  f9:   aa                      stos   BYTE PTR es:[rdi],al

有趣的是，这仍然没有部分寄存器停顿，因为我们在仅写入部分寄存器后从未读过宽寄存器。 mov eax, esi是只写的，因此它会将负载的部分注册清理为al。因此使用movzx eax, byte [rdx+rax]没有任何好处。即使我们返回到调用者，pop rax也不会让调用者成功解决部分注册问题。

（如果我们不打扰在rax中返回输入指针，那么调用者可能会遇到问题。除非在这种情况下它根本不应该读取rax。通常只有当call在部分注册状态下使用调用保留的寄存器时才会很重要，因为被调用的函数可能push。或者更明显的是，使用arg-passing / return-value寄存器。 / p>

高效版本（uop-cache friendly）

向后循环并没有保存任何指令或字节，但我已经包含了这个版本，因为它与Frank的答案中的版本更加不同。

ALIGN 16
global register_to_hex_countdown
register_to_hex_countdown:
;;; work backwards in the buffer, starting with the least-significant nibble as the last char
    mov     rax, rsi             ; return value, and loop bound
    add     rsi, 15              ; last char of the buffer
    lea     rcx,  [rel hex_xlat] ; position-independent code

ALIGN 16
.loop:
    mov     edx, edi
    and     edx, 0x0f            ; isolate low nibble
    mov     dl, byte [rcx+rdx]   ; look up the ascii encoding for the hex digit
                                  ; rdx is an 'index' with range 0x0 - 0xf
                         ; non-PIC version:    mov     dl, [hex_digits + rdx]
    mov     byte [rsi], dl
    shr     rdi, 4
    dec     rsi
    cmp     rsi, rax
    jae    .loop                 ; rsi counts backwards down to its initial value

    ret

整个事情只有12个insn（宏观融合11个uop，或12个包括NOP用于对齐）。有些CPU可以融合cmp / jcc而不是dec / jcc（例如AMD和Nehalem）

向后循环的另一个选项是mov ecx, 15，并以mov [rsi+rcx], dl存储，但双寄存器寻址模式不能微融合。不过，这只会使循环达到8微秒，所以没关系。

此版本可以使用rdi变为零作为循环条件，而不是始终存储16位数，以避免打印前导零。即。

    add     rsi, 16
    ...
.loop:
    ...
    dec     rsi
    mov     byte [rsi], dl
    shr     rdi, 4
    jnz    .loop
        ; lea rax,  [rsi+1]    ; correction not needed because of adjustments to how rsi is managed
    mov     rax, rsi
    ret

从rax打印到缓冲区的末尾只给出整数的有效数字。

将寄存器值打印到控制台

3 个答案:

测试`main()`：

LUT

针对代码大小进行了优化（适合32（0x20）字节）

高效版本（uop-cache friendly）

将寄存器值打印到控制台

3 个答案:

测试main()：

LUT

针对代码大小进行了优化（适合32（0x20）字节）

高效版本（uop-cache friendly）

测试`main()`：