Question

当我得到一个网页时，我使用UnicodeDammit将其转换为utf-8编码，就像：

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

但是当我使用时：

text = doc.text_content()
print type(text)

输出为<type 'lxml.etree._ElementUnicodeResult'>。为什么？我以为这将是一个utf-8字符串。

Answer 1

lxml.etree._ElementUnicodeResult是一个继承自unicode：

的类

$ pydoc lxml.etree._ElementUnicodeResult

lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
 |  Method resolution order:
 |      _ElementUnicodeResult
 |      __builtin__.unicode
 |      __builtin__.basestring
 |      __builtin__.object

在Python中，拥有从基类扩展的类以添加一些特定于模块的功能是相当普遍的。将对象视为常规Unicode字符串应该是安全的。

Answer 2

您可能希望跳过重新编码步骤，因为lxml.html将自动使用源文件中指定的编码，并且只要它最终作为有效的unicode，就没有理由关注它是如何最初编码的。

除非您的项目非常小且非正式，否则您可以确定您永远不会遇到8位字符串（即它总是7位ASCII，英语没有特殊字符），所以尽早将文本转换为unicode是明智的尽可能（就像检索后一样）并保持这种状态，直到你需要序列化它来写入文件或通过套接字发送。

您看到<type 'lxml.etree._ElementUnicodeResult'>的原因是因为lxml.html.fromstring()会自动为您执行解码步骤。请注意，这意味着上面的代码对于使用UTF-16编码的页面不起作用，例如，因为8位字符串将以UTF-8编码，但html仍然会说utf-16

<meta http-equiv="Content-Type" content="text/html; charset=utf-16" />

和lxml将尝试根据utf-16编码规则解码字符串，以我期望的短时间内引发异常。

如果您希望将输出序列化为 UTF-8 编码的8位字符串，您只需要：

>>> text = doc.text_content().encode('utf-8')
>>> print type(text)
<type 'str'>

python lxml模块在内部使用哪种编码？

2 个答案: