Question

我正在使用lxml的etree创建一个个人的RSS阅读器，但我无法转换回原始角色。我期待看到“2014年世界杯：JúlioCésar的帮助”：

url = 'rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = etree.parse(url)
for x in xml.findall('.//item'):
    text = x.find('.//description').text
    print text
    # 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
    text = text.encode('utf-8')
    print text
    # 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
    text = text.decode('utf-8')
    # Error: 'UnicodeEncodeError: 'ascii' codec can't encode character....'

我已阅读Python's Unicode HOWTO以及Joel's Unicode Intro但我必须遗漏一些内容。

编辑：差不多要感谢unutbu ...只需要帮助转换\ u2019：

content = 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
html = LH.fromstring(content)
text = html.text_content()
print text
print(repr(text))
print text.encode('utf-8')

##RESULTS##
World Cup 2014: With Júlio César\u2019s Help
u'World Cup 2014: With J\xfalio C\xe9sar\\u2019s Help'
World Cup 2014: With Júlio César\u2019s Help

Answer 1

在UnicodeEncodeError之前，我相信text是unicode：

text = u'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
text = text.decode('utf-8')

重现错误消息：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 22: ordinal not in range(128)

在Python2中，lxml sometimes returns str for text, and sometimes unicode。实际上，如果你运行这个脚本，你会看到这种不幸的行为：

import lxml.etree as ET
import urllib2

url = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = ET.parse(urllib2.urlopen(url))
for x in xml.findall('.//item'):
    text = x.find('.//description').text
    print(type(text))

打印

<type 'str'>
<type 'str'>
<type 'str'>
<type 'unicode'>
<type 'str'>
<type 'unicode'>
...

但是，当文本由纯ASCII值（即0到127之间的字节值）组成时，它只返回str。

虽然通常不应编码str s，编码由str组成的utf-8 使用str在0-127（ASCII）范围内的字节值保留str。

因此，您可以通过使用unicode对两者进行编码来实际同时处理utf-8和text，就好像unicode一样text。

由于lxml.html确实是HTML，因此我使用str将HTML缩减为纯文本内容。这也可以是unicode或text。然后在打印之前对该对象import lxml.etree as ET import lxml.html as LH import urllib2 url = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' xml = ET.parse(urllib2.urlopen(url)) for x in xml.findall('.//item'): content = x.find('.//description').text html = LH.fromstring(content) text = html.text_content() print(text.encode('utf-8'))进行编码：

text = u'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
text = text.decode('utf-8')
# Error: 'UnicodeEncodeError: 'ascii' codec can't encode character....'

请注意，在Python3中，lxml始终返回一个unicode，因此可以恢复思想的纯洁。

UnicodeEncodeError如何发生：

text

首先请注意，这是一个 UnicodeEncodeError ，即使您要求Python 解码 ascii。进一步请注意，错误消息显示Python正在尝试使用text编解码器。

这是该问题与Python2's automatic conversion between str and unicode.

有关的经典迹象

假设text.decode('utf-8')是unicode。如果你打电话

ascii

然后你要求Python执行no-no - 解码unicode。但是，在使用utf-8进行解码之前，Python2会尝试使用str编解码器静默地首先编码 unicode。 unicode和bytes之间的这种自动转换意味着只需要在ASCII范围内处理str和unicode的便利，但它会使精神不明显，因为它鼓励程序员忘记它们之间的区别str和unicode只有在值范围在ASCII范围内时才有效。当值超出ASCII范围时会出现错误 - 这就是您遇到的情况。

在Python3中，str和str之间没有自动转换（或者Python2的说法分别为unicode和bytes）。当您尝试编码str或解码{{1}}时，Python只会引发错误。恢复心智清晰度，代价是迫使程序员注意类型。然而，正如这个问题所表明的那样，即使使用Python2，成本也是不可避免的。

Answer 2

您在单个字符串中混合使用Latin-1（\ xfa）和Unicode（\ u2019）。 Python编码方法无法处理。

如何使用python获取原始字符？

2 个答案: