Question

我有一个包含不同字符串的列表。有时它们位于cp1251，ASCII或其他内容中。我需要处理它们（转换为Unicode），因为我收到了一个错误（UncicodeDecodeError），特别是当我试图将这些数据转储到JSON时。

我该怎么做？

Answer 1

您可以使用chardet来检测字符串的编码，因此将其列表转换为unicode（在Python 2.x中）的一种方法是：

import chardet

def unicodify(seq, min_confidence=0.5):
    result = []
    for text in seq:
        guess = chardet.detect(text)
        if guess["confidence"] < min_confidence:
            # chardet isn't confident enough in its guess, so:
            raise UnicodeDecodeError
        decoded = text.decode(guess["encoding"])
        result.append(decoded)
    return result

...你可以这样使用：

>>> unicodify(["¿qué?", "什么？", "what?"])
[u'\xbfqu\xe9?', u'\u4ec0\u4e48\uff1f', u'what?']

CAVEAT ：像chardet这样的解决方案只能用作最后的手段（例如，修复因过去的错误而损坏的数据集）。在生产代码中依赖它太脆弱了;相反，正如@ bames53在对此答案的评论中指出的那样，您应该首先修复损坏数据的代码。

Answer 2

如果你知道编码，那应该很简单：

unicode_string = encoded_string.decode(encoding)

如果您不知道编码，可能很难检测到它，但这取决于您期望的编码和语言。

Answer 3

尝试使用unicode函数将字符串转换为内置的unicode类型。

>>> s = "Some string"
>>> s = unicode(s)
>>> type(s)
<type 'unicode'>

对于您的问题，请尝试此操作以创建新的unicode字符串列表。

new = []
for item in myList:
    new.append(unicode(item))

或使用列表理解

new = [unicode(item) for item in myList]

阅读官方Python Unicode HOWTO。

检查编码并转换为Unicode

3 个答案: