Question

以文字

提供的输入

# Python 2.7
>>> bs4.BeautifulSoup("H&W Insurance")
<html><body><p>H&amp;W Insurance</p></body></html>

# Python 3.5.2
>>> import lxml.html
>>> h = lxml.html.fromstring("H&W Insurance")
>>> lxml.html.tostring(h)
b'<p>H&amp;W Insurance</p>'

BeautifulSoup和lxml已正确转义我的输入内容。但是我怎么猜测我输入的是Text而不是HTML？是否有任何标准算法，对我来说似乎并不是真正的微不足道。

以HTML格式提供的输入

# Python 2.7
>>> bs4.BeautifulSoup("<html>H&W Insurance<html>")
<html><body><p>H&amp;W Insurance</p></body></html>

# Python 3.5.2
>>> h = lxml.html.fromstring("<html>H&W Insurance</html>")
>>> lxml.html.tostring(h)
b'<html><body><p>H&amp;W Insurance</p></body></html>'

为什么&会转换为&。这可能不是输入中的HTML实体引用&字符，或者自动更正了BeautifulSoup，因为&W在HTML中没有意义所以它必须是&？

在HTML中猜测HTML与纯文本和实体引用字符

0 个答案: