Question

我在汇编中编写了一个带有无符号长整数的函数。这个长的是UTF-8字符。

我想检查它是1,2,3或4字节的UTF-8字符。到目前为止，我有这个:(我已经改为代码，不受字节序的影响，我认为......）

movl    12(%ebp),%eax   # Move long u to %eax
movl    %eax,buff       # Move long u to buff
andl    $128,buff       # &-mask 1 MSB (from LSByte)
cmpl    $0,buff         # Compare buff to 0
je      wu8_1byte       # If 0, 1 byte UTF8

movl    12(%ebp),%eax   # Move long u to %eax
movl    %eax,buff       # Move long u to buff
andl    $0xE000,buff    # &-mask 3 MSB (from byte LSByte 2)
cmpl    $0xC000,buff    # Compare the 3 MSB to binary 110
je      wu8_2byte       # If =, 2 byte UTF8

movl    12(%ebp),%eax   # Move long u to %eax
movl    %eax,buff       # Move long u to buff
andl    $0xF00000,buff  # &-mask 4 MSB (from byte MSByte 3)
cmpl    $0xE00000,buff  # Compare the 4 MSB to binary 1110
je      wu8_3byte       # If =, 3 byte UTF8

jmp     wu8_4byte       # If no, 4 byte UTF8

12（％ebp）是我想要的长期。 Buff是一个4字节的变量。

它适用于1byte，但不适用于其他人。

关于我如何弄清楚它是什么类型的UTF-8字符的任何提示？

UTF-8编码：

                           0xxxxxxx    # 1 byte
                  110xxxxx 10xxxxxx    # 2 byte
         1110xxxx 10xxxxxx 10xxxxxx    # 3 byte
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    # 4 byte

Answer 1

由于一个简单的原因，它不适用于任何一个。

您获取32位值并将其向右移动。然后你将它与一个常数相比较，忘记了还有比你要比较的那些更多的比特。

你必须和值只取你想要的位：

movl 12(%ebp),%eax
movl %eax,buff
shrb $13,buff #UTF8 2 byte looks like 110xxxxx 10xxxxxx
andl $7, buff # Take only the three lowest bits
cmpl $6,buff #Therefore shift 13 spaces right and check
je wu8_2byte #if buff=6 (110 = 6)

我也会在寄存器中处理它，而不是在内存位置处理它以使其更快。您也可以只使用一个而不需要任何班次。

Answer 2

根据您想要进行的错误检查的次数，您可以使用test指令简单地测试位。我假设unsigned long已从一系列UTF-8编码字节加载，首先是最低有效字节，这与将char*别名为unsigned long*的结果相同一个小端机。

如果这些假设是错误的，那么您可能需要相应地更改代码 - 它可能更复杂，因为您可能不知道哪个字节是前导字节。

E.g。

movl 12(%ebp),%eax
testl $128,%eax
jz wu8_1byte
testl $32,%eax     # We know that the top bit is set, it's not valid for it to be
                   # 10xxxxxx so we test this bit: 11?xxxxx
jz wu8_2byte
testl $16,%eax     # 111?xxxx
jz wu8_3byte
# Must be 4 byte
jmp wu8_4byte

此代码段与原始代码的假设相同。

movl 12(%ebp),%eax

testl $0x80,%eax
jz wu8_1byte
                     # We can assume that the last byte is of the form 10xxxxxx
testl $0x7000,%eax   # Testing this bit in byte n - 1: 1?xxxxxx
jnz wu8_2byte

testl $0x700000,%eax # Testing this bit in byte n - 2: 1?xxxxxx
jnz wu8_3byte
# Must be 4 byte
jmp wu8_4byte

Answer 3

我通过阅读UTF-8并找到更简单的解决方案解决了这个问题：

cmpl    $0x7F,12(%ebp)     # Compare unsigned long to 1 byte UTF-8 max value
jbe     wu8_1byte

cmpl    $0x7FF,12(%ebp)    # Compare unsigned long to 2 byte UTF-8 max value
jbe     wu8_2byte

cmpl    $0xFFFF,12(%ebp)   # Compare unsigned long to 3 byte UTF-8 max value
jbe     wu8_3byte

cmpl    $0xFFFFFF,12(%ebp) # Compare unsigned long to 4 byte UTF-8 max value
jbe     wu8_4byte

UTF-8字符的编码方式，1字节字符的最大值为0x7F，最大值为2字节0x7FF，最大值为3字节0xFFFF，最大值为4字节0xFFFFFF。因此，通过将无符号长整数与这些值进行比较，我可以确定解码字符所需的字节数。

汇编AT＆amp; T x86 - 如何比较长的特定字节？

3 个答案: