我希望通过标签检查两个html是否有所不同而不考虑文本并选择那些分支。
例如:
html_1 = """
<p>i love it</p>
"""
html_2 = """
<p>i love it really</p>
"""
它们共享相同的标签结构,因此它们被视为相同。但是:
html_1 = """
<div>
<p>i love it</p>
</div>
<p>i love it</p>
"""
html_2 = """
<div>
<p>i <em>love</em> it</p>
</div>
<p>i love it</p>
"""
我希望它返回<div>
分支,因为标记结构不同。 lxml
,BeautifulSoup
或其他一些lib能实现吗?我试图找到一种方法来实际挑选不同的分支。
由于
答案 0 :(得分:1)
更可靠的方法是在文档中构建Tree标记名称,如下所述:
以下是基于treelib.Tree
的示例工作解决方案:
from bs4 import BeautifulSoup
from treelib import Tree
def traverse(parent, tree):
tree.create_node(parent.name, parent.name, parent=parent.parent.name if parent.parent else None)
for node in parent.find_all(recursive=False):
tree.create_node(node.name, parent=parent.name)
traverse(node, tree)
def compare(html1, html2):
tree1 = Tree()
traverse(BeautifulSoup(html1, "html.parser"), tree1)
tree2 = Tree()
traverse(BeautifulSoup(html2, "html.parser"), tree2)
return tree1.to_json() == tree2.to_json()
print compare("<p>i love it</p>", "<p>i love it really</p>")
print compare("<p>i love it</p>", "<p>i <em>love</em> it</p>")
打印:
True
False
答案 1 :(得分:0)
检查两个HTML内容的标记结构的示例代码是否相同
<强>演示:强>
def getTagSequence(content):
"""
Get all Tag Sequence
"""
root = PARSER.fromstring(content)
tag_sequence = []
for elm in root.getiterator():
tag_sequence.append(elm.tag)
return tag_sequence
html_1_tags = getTagSequence(html_1)
html_2_tags = getTagSequence(html_2)
if html_1_tags==html_2_tags:
print "Tagging structure is same."
else:
print "Tagging structure is diffrent."
print "HTML 1 Tagging:", html_1_tags
print "HTML 2 Tagging:", html_2_tags
注意:强>
上面的代码只检查标记序列,不检查父项及其子项关系,即
html_1 = """ <p> This <span>is <em>p</em></span> tag</p>"""
html_2 = """ <p> This <span>is </span><em>p</em> tag</p>"""