Question

我有一个接受网络请求的功能。大多数情况下，传入的字符串不是unicode，但有时它是。

我有代码将所有内容转换为unicode，但它会报告此错误：

message.create(username, unicode(body, "utf-8"), self.get_room_name(),\
TypeError: decoding Unicode is not supported

我认为原因是'body'参数已经是unicode，因此unicode()会引发异常。

有没有办法避免这种例外，例如判断转换前的类型？

Answer 1

您不能解码为UTF-8，编码为 UTF-8或从解码。
即使只是ASCII，您也可以安全地从UTF8解码。 ASCII是UTF8的子集。

检测是否需要解码的最简单方法是

if not isinstance(data, unicode):
    # It's not Unicode!
    data = data.decode('UTF8')

Answer 2

你可以使用这个：

try:
   body = unicode(body)
except UnicodeDecodeError:
   body = body.decode('utf8')

或者这个：

try:
   body = unicode(body, 'utf8')
except TypeError:
   body = unicode(body)

Answer 3

Mark Pilgrim编写了一个Python库来猜测文本编码：

http://chardet.feedparser.org/

在Unicode和UTF-8上，他的书“潜入Python 3”第4章的前两部分非常棒：

http://diveintopython3.org/strings.html

Answer 4

这就是我使用的：

def to_unicode_or_bust(obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

摘自本演示文稿：http://farmdev.com/talks/unicode/

这是一个使用它的示例代码：

def hash_it_safe(s):
    try:
        s = to_unicode_or_bust(s)
        return hash_it_basic(s)
    except UnicodeDecodeError:
        return hash_it_basic(s)
    except UnicodeEncodeError:
        assert type(s) is unicode
        return hash_it_basic(s.encode('utf-8'))

任何人都对如何改进此代码有一些想法？ ;）

python unicode：如何判断字符串是否需要解码为utf-8？

4 个答案: