Question

当元素包含unicode字符时，在使用lxml获取属性时遇到问题。我已经阅读了许多讨论UTF-8解码的其他线程，但我似乎无法使其正常工作。我发现的任何东西都没有以属性的形式给出上下文。

目前，如果html标签没有unicode字符，那么获取属性就像这样，例如当我打印文档并且html标记看起来像这样<html lang="en" class="">时，我得到{'lang': 'en', 'class': ''} as您希望html_attributes

from urllib import urlopen
from lxml import etree

document = urlopen(url).read()

#Print the document
print(document)

tree = etree.HTML(document)
html_attributes = tree.attrib

# Print the attributes
print(html_attributes)

但是每当我向lxml提供HTML标记上带有unicode属性的HTML文档文档时，如下所示：<html \u26a1="">我最终得到一个空对象：来自{}的{{1}}

有没有解决方法，所以它会打印unicode属性？

Unicode和lxml属性的问题

0 个答案: