Question

我写了一个寻找回文的代码。但我的代码显示所有情况的输出“不是pallindrome”。我的计划如下：

section .data
    a db "mommom",0
    b equ $-a

    msg1 db "is pallindrome",10,0
    msg2 db "is not pallindrome",10,0
    msg3 db "",10,0
section .text
    global main
    extern printf
main:
    nop
    xor eax,eax
    xor ebx,ebx
    mov eax,a       ;starting add
    mov ebx,b
    add eax,ebx
    dec eax         ;will use to indicate the last letter of a

    xor ebx,ebx
    xor edx,edx
    xor ecx,ecx

start:
    inc ecx
    cmp ecx,(b/2)       ;check will run for half of the word
    jle check
    jmp pal
check:  
    mov dl,byte[eax]    ;last letter
    cmp byte[a+ebx],dl  ;frst letter compares with last letter
debug:
    pusha           ;debugging purpose.Used to catch the first letter of a
    push byte[a+ebx]
    push msg3
    call printf
    add esp,8
    popa
checkContinue:
    inc ebx         ;use for check segment
    dec eax
    je start
    jne nonPal
pal:
    pusha
    push msg1
    call printf
    add esp,4
    popa
    jmp done
nonPal:
    pusha
    push msg2
    call printf
    add esp,4
    popa
    jmp done
done:
     nop

Antoine Mathys已经向我们提供了上述代码的相应版本，指出了此代码中出现的错误。他的评论部分对我们来说非常重要，就像新手一样。在这个上面的程序中，我试图打印驻留在ebx寄存器中的每个字符，但我没有得到它。如果任何导师可以解决问题的这一部分，我将不胜感激。它将帮助我学习如何从字符串中取出每个字符。

Answer 1

这是：

        BITS 32
        section .data
        string db "mommom"
        length equ $ - string

        msg1 db "is pallindrome",0
        msg2 db "is not pallindrome",0

        section .text
        global main
        extern puts

main:
        mov ebx, string                   ; start of word                                 
        mov eax, (string + length - 1)    ; end of word                                   

        mov ecx, (length / 2)             ; check will run for half of the word           
check:
        mov dl, [ebx]                     ; compare first and last letters                
        cmp [eax], dl
        jne failure
        inc ebx
        dec eax
        loop check

        ;; success                                                                  
        push msg1
        call puts
        add esp,4
        jmp done

failure:
        push msg2
        call puts
        add esp,4

done:
        ret

一些评论：

不要在要测试的单词上添加零。这将使它成为非回文。
你的代码充满了不必要的指令（pusha / popa，nop，清除edx但只使用dl，......）。尽量保持代码尽可能简单。
为清晰起见，请不要使用xor技巧
使用有意义的符号名称
利用汇编算术表达式

Answer 2

我一直在用初学者的笨重代码看到这些回文问题，这让我有兴趣编写一个实际上有效的版本。我的一次一个字节循环应该在Intel Haswell上每个时钟运行一次，但在早期的Intel上运行速度较慢（因为循环将超过4个融合域uops，因为背对背的宏聚合有限CMP / JCC）。

另请参阅下面的一种方法，使其不区分大小写。

一次检查多个字节，即使在中间的交叉处重叠：

 abcdcba
 abcd
    dcba    ; overlap by one
  bcdc
   cdcb     ; overlap by two

加载一些字节，反转订单bswap，Silvermont / Haswell movbe，SSSE3 pshufb，甚至rol ax,8 / rol eax,16 / {{1 }}。然后与应该匹配的相同字节数进行比较。

添加您喜欢的结果打印。我只是退回退出状态。

所有这些在32位中的工作方式相同，但我使用了64位，因为寄存器调用约定避免了堆栈操作代码中的混乱。

请注意使用本地标签（rol ax,8）以避免使用相同标签名称的类似代码块之间发生冲突。

.label

（如何从gcc output on godbolt编写一些DEFAULT rel section .text ALIGN 16 global check_palindrome ; AMD64 SysV calling convention check_palindrome: ; (size_t len /*rdi*/, const char *str /*rsi*/) ;;returns bool cmp rdi, 8 jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs add rdi, rsi ; rdi = end pointer ;ALIGN 16 ; probably not worth it, depending on the CPU .palin_loop: lodsq ; rax = [rsi], rsi+=8. lodsd/q is only 2 uops on Haswell bswap rax sub rdi, 8 cmp rax, [rdi] jne .not_palin cmp rsi, rdi jb .palin_loop ; stop looping when the pointers cross ;; Loop has 7 uops on Haswell, so it can sustain one iteration (8 bytes in each direction) per 2 cycles. ;; 9 uops on SnB, where lodsq is 3 uops, and only one of cmp/jcc pairs can macro-fuse. For SnB, use mov / add instead of lodsq ;; with unrolling, we might get this down closer to 2x8B per clock, instead of per 2 clocks. ;; or with SSE/AVX, 2x16B per clock. Probably not 2x32B per clock with AVX, due to needing two shuffles. (no cross-lane byte shuffle until AVX512) ; input was a palindrome mov eax, 1 ; return true ret .not_palin: xor eax,eax ; return false ret ALIGN 16 ;; helper function with the same signature as the main version ; only needed for small strings, not for unaligned, or not-multiple-of-8 ; assume that rdi < 2^32 so we can use edi interchangeably, for smaller code-size. ; If our prototype was (unsigned int, const char*), we'd have to mov ecx, edi or something. ; (moving to a different reg is preferable to mov edi,edi because mov-elimination never works on mov same,same) check_palindrome_byte_at_a_time: test edi,edi ; or cmp edi, 1 since every 1-char string is also a palindrome jz .is_palin ; the empty string is a palindrome ;ALIGN 16 .palin_loop: mov al, [rsi + rdi - 1] ; 2-register addresses can't micro-fuse on SnB-family, but this is a pure load that doesn't need to micro-fuse with anything. cmp al, [rsi] jne check_palindrome.not_palin inc rsi ; pointer moves forward sub edi, 2 ; index counts down towards zero twice as fast ja .palin_loop ; treat counter as unsigned. Not jae, because a byte is always equal to itself. ;;; Haswell and later can fuse both branches even when they hit the decoders in the same cycle ;;; so this loop should hopefully be 4 fused-domain uops and run at one iteration per clock .is_palin: mov eax, 1 ; return true ret ; .not_palin: ; shared code with other check_palindrome version global main extern strlen ALIGN 16 main: ;; return !check_palindrome(strlen(argv[1]), argv[1]) ;mov edi, inputstr_len ;mov esi, inputstr mov rdi, [rsi+8] ; argv[1] push rdi ; save it, and align the stack call strlen mov rdi, rax mov rsi, [rsp] ; could pop here and take advantage of the fact that we know check_palindrome doesn't care about the stack call check_palindrome pop rdi ; shorter than add rsp,8 and faster on Intel CPUs with a stack engine, given the surrounding instructions xor eax, 1 ; we know check_palin returns a _Bool, so flip it ret ;;; Unused, but improved in case you do want to use it section .rodata inputstr db "mommom" inputstr_len equ $-inputstr msg_palin db "is palindrome",10,0 msg_nopalin db "is not palindrome" ; newline an terminating zero are on the next line, with their own label msg_newline db 10,0的灵感。编译器输出通常是一个很好的起点。）

经过测试和工作：

main()

不区分大小写的：

除了字节反转一个块之外，在比较之前将两个块的字母字符强制为小写。使用ASCII，this is easy。我链接的答案甚至有一个案例翻转的SSE矢量实现，可以修改为只强制所有字母字符到同一个案例（用$ yasm -Worphan-labels -gdwarf2 -felf64 palindrome.asm && gcc palindrome.o -o palindrome $ ./palindrome '' && echo "true" || echo "false" true $ ./palindrome 1 && echo "true" || echo false true $ ./palindrome abccba && echo "true" || echo "false" true # even size: pair of middle chars $ ./palindrome abcdcba && echo "true" || echo "false" true # odd size: single middle char $ ./palindrome abcdeba && echo "true" || echo "false" false $ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb ba' && echo "true" || echo "false" true $ ./palindrome 'ab bcdcb 1234 bcdcb bab bcdcb 4321 bcdcb ba' && echo "true" || echo "false" true $ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb baa' && echo "true" || echo "false" false $ ./palindrome && echo "true" || echo "false" Segmentation fault (core dumped) # main doesn't check argc替换最终的pxor，或使用混合将比较结果作为控件，而不是por / pandn。）

开卷：

除了使用置换（por，[rsi+8]等）来保存添加/子指令外，还可以通过对一些比较结果进行ORing或AND来减少分支数量。这可能会也可能不会保存任何东西。在一个完全不受partial-register merging slowdowns影响的CPU上（例如AMD），这可能是一场胜利：

[rdi-8]

使用SSE，您已经在.palin_loop: xor eax,eax ; inside the loop to break the false dependency xor edx,edx movbe ... / cmp ... ; movbe is a load and endian swap in one insn. On Haswell it just saves code-size, not uops. On Silvermont it's a win setne dl movbe rcx, [rsi+rdi-16] / cmp [rsi+8], rcx setne dh shl edx, 16 ; DH merging takes a cycle on Haswell ... repeat setting dl/dh so edx holds 4 results from four 8B checks movbe ... / cmp ... setne al movbe ... / cmp ... setne ah shl eax, 16 ... repeat setting al/ah or eax, edx ; or unroll twice as much, and use rax,rdx jnz .not_palin的整数注册中比较结果，而不是需要pmovmskb，并且ANDing结果是一个更大的胜利。

AVX：

看起来像一次一个字节版本的循环结构是想法，我们使用pshufb来反转整个向量的顺序。 setcc和测试位掩码是处理矢量比较结果的非常标准的习惯用法。它比pmovmskb快，因为PTEST是2微秒且无法宏观融合。

PTEST

Palindrome使用NASM

2 个答案:

经过测试和工作：

不区分大小写的：

开卷：

AVX：