Question

当使用LXML解析html文档，然后使用etree.tostring（）时，我注意到链接中的＆符号正被转换为html转义实体。

由于显而易见的原因，这打破了链接。以下是该问题的一个简单的自包含示例：

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring("""<a href="https://www.example.com/?param1=value1&param2=value2">link</a>""", parser)
>>> etree.tostring(tree)
'<html><body><a href="https://www.example.com/?param1=value1&amp;param2=value2">link</a></body></html>'

我希望输出结果为：

<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>

Answer 1

虽然＆amp;编码应该是standard way。如果由于某些原因确实需要避免转换，那么你可以这样做：

第1步。查找一个不应存在于html源代码中的唯一字符串。如果您有信心，可以使用 ANDamp; 作为reserved_amp变量＆＃34; ANDamp;＆＃34;字符串不会出现在您的html源代码中。否则，您可能会考虑生成随机字母并检查以确保此字符串不存在于您的html源代码中：

>>> import random
>>> import string
>>> length = 15 #increase the length if it's still seems to be collide
>>> reserved_amp = "&amp;"
>>> html = """<a href="https://www.example.com/?param1=value1&param2=value2">link</a>"""
>>> while reserved_amp in [html, "&amp;"]: 
...     reserved_amp = ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(length)) + "amp;" #amp; is for you easy to spot on
... 
>>> print reserved_amp
2eya6oywxg5z7q5amp;

第2步。替换所有＆amp;在解析之前：

>>> html = html.replace("&", reserved_amp)
>>> html
'<a href="https://www.example.com/?param1=value12eya6oywxg5z7q5amp;param2=value2">link</a>'
>>>

第3步。只有在您需要原始表单时才将其替换回来：

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> etree.tostring(tree).replace(reserved_amp, "&")
'<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>'
>>>

<强> [UPDATE]：

reserved_amp末尾的冒号是安全警卫。

如果我们生成类似的reserved_amp怎么办？

ampXampXampXampX + amp;

html包含：

yyYampX&

它将以这种形式编码：

yyYampXampXampXampXampXamp;

但是，由于冒号安全防范，因此无法返回/解码错误的反转结果，例如yy&YampX（原始yyYampX&）最后一个字符是非ASCII字母，永远不会从上面的reserved_amp生成为string.ascii_lowercase + string.digits。

因此，确保随机不使用冒号（或其他非ASCII字符），然后将其追加到末尾（必须是最后一个字符），无需担心yyYampX&还原为{{1陷阱。

Answer 2

根据lxml's tostring() docs，可以传递method='xml'来避免html的细节

etree.tostring(tree, method='xml')

在我的项目中，我使用：

from lxml import html
html.tostring(node, with_tail=False, method='xml', encoding='unicode')

LXML的etree.tostring在链接href属性中转义url

2 个答案: