Question

我有一个Python脚本，可以从许多来源（数据库，文件等）中提取数据。据说，所有字符串都是unicode，但我最终获得的是以下主题的任何变体（由repr()返回）：

u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'

是否有可靠的方法来获取上述任何四个字符串并返回正确的unicode字符串？

u'D\xe9cor' # --> Décor

我现在能想到的唯一方法是使用eval()，replace()，以及一种永远不会消失的深刻，灼热的耻辱。

Answer 1

那只是UTF-8 data。使用.decode将其转换为unicode。

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'

您可以为'D\\xc3\\xa9cor'案例执行额外的字符串转义解码。

>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

要处理第二种情况，您需要检测输入是否为unicode，并先将其转换为str。

>>> def conv(s):
...   if isinstance(s, unicode):
...     s = s.encode('iso-8859-1')
...   return s.decode('string-escape').decode('utf-8')
... 
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']

Answer 2

编写适配器，知道应将哪些转换应用于其源。

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

Answer 3

这是我在看到Kenny正确，更简洁的解决方案之前找到的解决方案：

def ensure_unicode(string):
    try:
        string = string.decode('string-escape').decode('string-escape')
    except UnicodeEncodeError:
        string = string.encode('raw_unicode_escape')

    return unicode(string, 'utf-8')

在Python中处理古怪的编码

3 个答案: