我写了一个寻找回文的代码。但我的代码显示所有情况的输出“不是pallindrome”。我的计划如下:
section .data
a db "mommom",0
b equ $-a
msg1 db "is pallindrome",10,0
msg2 db "is not pallindrome",10,0
msg3 db "",10,0
section .text
global main
extern printf
main:
nop
xor eax,eax
xor ebx,ebx
mov eax,a ;starting add
mov ebx,b
add eax,ebx
dec eax ;will use to indicate the last letter of a
xor ebx,ebx
xor edx,edx
xor ecx,ecx
start:
inc ecx
cmp ecx,(b/2) ;check will run for half of the word
jle check
jmp pal
check:
mov dl,byte[eax] ;last letter
cmp byte[a+ebx],dl ;frst letter compares with last letter
debug:
pusha ;debugging purpose.Used to catch the first letter of a
push byte[a+ebx]
push msg3
call printf
add esp,8
popa
checkContinue:
inc ebx ;use for check segment
dec eax
je start
jne nonPal
pal:
pusha
push msg1
call printf
add esp,4
popa
jmp done
nonPal:
pusha
push msg2
call printf
add esp,4
popa
jmp done
done:
nop
Antoine Mathys已经向我们提供了上述代码的相应版本,指出了此代码中出现的错误。他的评论部分对我们来说非常重要,就像新手一样。在这个上面的程序中,我试图打印驻留在ebx寄存器中的每个字符,但我没有得到它。如果任何导师可以解决问题的这一部分,我将不胜感激。它将帮助我学习如何从字符串中取出每个字符。
答案 0 :(得分:0)
这是:
BITS 32
section .data
string db "mommom"
length equ $ - string
msg1 db "is pallindrome",0
msg2 db "is not pallindrome",0
section .text
global main
extern puts
main:
mov ebx, string ; start of word
mov eax, (string + length - 1) ; end of word
mov ecx, (length / 2) ; check will run for half of the word
check:
mov dl, [ebx] ; compare first and last letters
cmp [eax], dl
jne failure
inc ebx
dec eax
loop check
;; success
push msg1
call puts
add esp,4
jmp done
failure:
push msg2
call puts
add esp,4
done:
ret
一些评论:
答案 1 :(得分:0)
我一直在用初学者的笨重代码看到这些回文问题,这让我有兴趣编写一个实际上有效的版本。我的一次一个字节循环应该在Intel Haswell上每个时钟运行一次,但在早期的Intel上运行速度较慢(因为循环将超过4个融合域uops,因为背对背的宏聚合有限CMP / JCC)。
另请参阅下面的一种方法,使其不区分大小写。
一次检查多个字节,即使在中间的交叉处重叠:
abcdcba
abcd
dcba ; overlap by one
bcdc
cdcb ; overlap by two
加载一些字节,反转订单bswap
,Silvermont / Haswell movbe
,SSSE3 pshufb
,甚至rol ax,8
/ rol eax,16
/ {{1 }}。然后与应该匹配的相同字节数进行比较。
添加您喜欢的结果打印。我只是退回退出状态。
所有这些在32位中的工作方式相同,但我使用了64位,因为寄存器调用约定避免了堆栈操作代码中的混乱。
请注意使用本地标签(rol ax,8
)以避免使用相同标签名称的类似代码块之间发生冲突。
.label
(如何从gcc output on godbolt编写一些DEFAULT rel
section .text
ALIGN 16
global check_palindrome
; AMD64 SysV calling convention
check_palindrome: ; (size_t len /*rdi*/, const char *str /*rsi*/)
;;returns bool
cmp rdi, 8
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
add rdi, rsi ; rdi = end pointer
;ALIGN 16 ; probably not worth it, depending on the CPU
.palin_loop:
lodsq ; rax = [rsi], rsi+=8. lodsd/q is only 2 uops on Haswell
bswap rax
sub rdi, 8
cmp rax, [rdi]
jne .not_palin
cmp rsi, rdi
jb .palin_loop ; stop looping when the pointers cross
;; Loop has 7 uops on Haswell, so it can sustain one iteration (8 bytes in each direction) per 2 cycles.
;; 9 uops on SnB, where lodsq is 3 uops, and only one of cmp/jcc pairs can macro-fuse. For SnB, use mov / add instead of lodsq
;; with unrolling, we might get this down closer to 2x8B per clock, instead of per 2 clocks.
;; or with SSE/AVX, 2x16B per clock. Probably not 2x32B per clock with AVX, due to needing two shuffles. (no cross-lane byte shuffle until AVX512)
; input was a palindrome
mov eax, 1 ; return true
ret
.not_palin:
xor eax,eax ; return false
ret
ALIGN 16
;; helper function with the same signature as the main version
; only needed for small strings, not for unaligned, or not-multiple-of-8
; assume that rdi < 2^32 so we can use edi interchangeably, for smaller code-size.
; If our prototype was (unsigned int, const char*), we'd have to mov ecx, edi or something.
; (moving to a different reg is preferable to mov edi,edi because mov-elimination never works on mov same,same)
check_palindrome_byte_at_a_time:
test edi,edi ; or cmp edi, 1 since every 1-char string is also a palindrome
jz .is_palin ; the empty string is a palindrome
;ALIGN 16
.palin_loop:
mov al, [rsi + rdi - 1] ; 2-register addresses can't micro-fuse on SnB-family, but this is a pure load that doesn't need to micro-fuse with anything.
cmp al, [rsi]
jne check_palindrome.not_palin
inc rsi ; pointer moves forward
sub edi, 2 ; index counts down towards zero twice as fast
ja .palin_loop ; treat counter as unsigned. Not jae, because a byte is always equal to itself.
;;; Haswell and later can fuse both branches even when they hit the decoders in the same cycle
;;; so this loop should hopefully be 4 fused-domain uops and run at one iteration per clock
.is_palin:
mov eax, 1 ; return true
ret
; .not_palin: ; shared code with other check_palindrome version
global main
extern strlen
ALIGN 16
main:
;; return !check_palindrome(strlen(argv[1]), argv[1])
;mov edi, inputstr_len
;mov esi, inputstr
mov rdi, [rsi+8] ; argv[1]
push rdi ; save it, and align the stack
call strlen
mov rdi, rax
mov rsi, [rsp] ; could pop here and take advantage of the fact that we know check_palindrome doesn't care about the stack
call check_palindrome
pop rdi ; shorter than add rsp,8 and faster on Intel CPUs with a stack engine, given the surrounding instructions
xor eax, 1 ; we know check_palin returns a _Bool, so flip it
ret
;;; Unused, but improved in case you do want to use it
section .rodata
inputstr db "mommom"
inputstr_len equ $-inputstr
msg_palin db "is palindrome",10,0
msg_nopalin db "is not palindrome" ; newline an terminating zero are on the next line, with their own label
msg_newline db 10,0
的灵感。编译器输出通常是一个很好的起点。)
main()
除了字节反转一个块之外,在比较之前将两个块的字母字符强制为小写。使用ASCII,this is easy。我链接的答案甚至有一个案例翻转的SSE矢量实现,可以修改为只强制所有字母字符到同一个案例(用$ yasm -Worphan-labels -gdwarf2 -felf64 palindrome.asm && gcc palindrome.o -o palindrome
$ ./palindrome '' && echo "true" || echo "false"
true
$ ./palindrome 1 && echo "true" || echo false
true
$ ./palindrome abccba && echo "true" || echo "false"
true # even size: pair of middle chars
$ ./palindrome abcdcba && echo "true" || echo "false"
true # odd size: single middle char
$ ./palindrome abcdeba && echo "true" || echo "false"
false
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb bab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb baa' && echo "true" || echo "false"
false
$ ./palindrome && echo "true" || echo "false"
Segmentation fault (core dumped) # main doesn't check argc
替换最终的pxor
,或使用混合将比较结果作为控件,而不是por
/ pandn
。)
除了使用置换(por
,[rsi+8]
等)来保存添加/子指令外,还可以通过对一些比较结果进行ORing或AND来减少分支数量。这可能会也可能不会保存任何东西。在一个完全不受partial-register merging slowdowns影响的CPU上(例如AMD),这可能是一场胜利:
[rdi-8]
使用SSE,您已经在.palin_loop:
xor eax,eax ; inside the loop to break the false dependency
xor edx,edx
movbe ... / cmp ... ; movbe is a load and endian swap in one insn. On Haswell it just saves code-size, not uops. On Silvermont it's a win
setne dl
movbe rcx, [rsi+rdi-16] / cmp [rsi+8], rcx
setne dh
shl edx, 16 ; DH merging takes a cycle on Haswell
... repeat setting dl/dh so edx holds 4 results from four 8B checks
movbe ... / cmp ...
setne al
movbe ... / cmp ...
setne ah
shl eax, 16
... repeat setting al/ah
or eax, edx ; or unroll twice as much, and use rax,rdx
jnz .not_palin
的整数注册中比较结果,而不是需要pmovmskb
,并且ANDing结果是一个更大的胜利。
看起来像一次一个字节版本的循环结构是想法,
我们使用pshufb来反转整个向量的顺序。 setcc
和测试位掩码是处理矢量比较结果的非常标准的习惯用法。它比pmovmskb
快,因为PTEST
是2微秒且无法宏观融合。
PTEST