Question

因此，我使用NewsPlease从Common Crawl新闻库中获取文章标题，但是当我获取文章标题时，它们是正常编码的字符和Unicode字节的混合，因此我无法正确对其进行编码。取其中一个标题：

x = articles[800].title

如果我在spyder中调用x，它将返回：

'Las 10 canciones m\\xc3\\xa1s populares de la semana'

当我使用 print(x) 我得到：

Las 10 canciones m\xc3\xa1s populares de la semana

但是，如果尝试使用以下方式正确编码：（如其他帖子所建议的那样）

x.encode('latin1').decode('utf8')

返回

'Las 10 canciones m\\xc3\\xa1s populares de la semana'

显然不正确。

有人有什么建议吗？我正在使用Python 3.6

Answer 1

找到了解决方案：

x = 'this is a test of the Spanish word m\\xc3\\xa1s'
x = x.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(x)
'this is a test of the Spanish word más'

带有Unicode字符和字节的Python 3.6凌乱字符串

1 个答案: