Question

我有一个unicode字符串，因此我想检查字符是连续位还是起始位，以便通过简单程序计算unicode字符数

#!/usr/bin/env python
# -*- coding: utf-8 -*-



def arg(str):

  i = 0
  j = 0
  print i

  for test in str:
    print test
    value = int(test,16)
    if (value & 0xc0) != 0x80:
        j=j+1
        print "hello"

  print j
  #return j






def main():
    print "inside main"

    new = "象形字"

    charlen = len(new)
    print charlen
    tes = new.decode('utf-8')

    declen = len(tes)
    print declen


    data = tes.encode('utf-8')


    # print self_len

    enclen = len(data)
    print enclen

    print data

    arg(data)







if __name__ == "__main__":
    main()

运行代码会产生错误

象形字[Decode error - output not utf-8]
Traceback (most recent call last):
  File "/Users/laxmi518/Documents/laxmi/code/C/python-c/python_unicode.py", line 69, in <module>
    main()
  File "/Users/laxmi518/Documents/laxmi/code/C/python-c/python_unicode.py", line 52, in main
    arg(data)
  File "/Users/laxmi518/Documents/laxmi/code/C/python-c/python_unicode.py", line 16, in arg
    value = int(test,16)
ValueError: invalid literal for int() with base 16: '\xe8'
[Finished in 0.1s with exit code 1]

Answer 1

UTF-8字节不是十六进制字符串。它们只是字节，Python将使用文字转义语法显示ASCII可打印范围之外的字节。这是只是调试显示符号。

使用ord() function获取字节的数值：

value = ord(test)

通过此更改，在Mac OS X（配置为UTF-8）的终端中运行脚本输出：

inside main
9
3
9
象形字
0
?
hello
?
?
?
hello
?
?
?
hello
?
?
3

问号由终端产生;从UTF-8字节流打印单个字节意味着您正在打印不完整的 UTF-8代码单元，因此终端不知道如何处理这些代码并生成占位符字符。

不是直接打印test，而是打印repr()函数的输出：

print repr(test)

为这些字节获取\xhh十六进制表示法：

inside main
9
3
9
象形字
0
'\xe8'
hello
'\xb1'
'\xa1'
'\xe5'
hello
'\xbd'
'\xa2'
'\xe5'
hello
'\xad'
'\x97'
3

解码错误ValueError：基数为16的int（）的无效文字：python中的'\ xe8'

1 个答案: