Question

假设我的字符串中包含一些unicode字符，我们需要对它进行操作，最好的方法是什么？

s = u"blah ascii_word etc شاهد word1 word 2" # Delimited by spaces

words = s.split(u' ')

>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in 
    position 91: ordinal not in range(128)

任何线索？

另外，如果我想将此代码写入文本文件并稍后阅读，那么该过程会是什么？

Answer 1

当您按照自己的方式声明变量时，Python假定它是您的默认系统编码，您必须在字符串之前添加u以使其成为unicode并在文件顶部添加编码声明，如果这样做，您就赢了得到任何错误：

# -*- coding: utf-8 -*-
s = u"blah ascii_word etc شاهد word1 word 2"
words = s.split(u' ')
print words
# no error even tough my default system's encoding is ascii

我现在已经检查了这个，你甚至不需要你 - 添加编码足以解决问题。

如果你想在termainal中使用unicode字符串，你必须检查系统编码并在必要时进行更改：

>>> import sys
>>> sys.getdefaultencoding()
'ascii' #I have ascii

然后，您可以使用sys.setdefaultencoding()来操纵此操作。但这是一个棘手的问题，取决于您的操作系统。

Python字符串选择unicode，UnicodeDecodeError

1 个答案: