在python中使用lxml打印html实体

时间:2014-12-07 05:59:09

标签: python html html-parsing lxml lxml.html

我试图用html实体从下面的字符串中创建一个div元素。由于我的字符串包含html实体,因此html实体中的&保留字符在输出中被转义为&。因此,html实体显示为纯文本。我怎样才能避免这种情况,以便正确呈现html实体?

s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>

1 个答案:

答案 0 :(得分:3)

您可以在调用encoding时指定tostring()

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

作为附注,您在处理HTML数据时should definitely use lxml.html.tostring()

  

请注意,您应该使用lxml.html.tostring而不是lxml.tostring。   lxml.tostring(doc)将返回文档的XML表示形式,   这是无效的HTML。特别是,<script src="..."></script>之类的内容会被序列化为<script src="..." />,这会让浏览器感到困惑。

另见: