识别标签结构中不同的分支

时间:2015-06-22 16:56:19

标签: python parsing dom beautifulsoup lxml

我希望通过标签检查两个html是否有所不同而不考虑文本并选择那些分支。

例如:

html_1 = """
<p>i love it</p>
"""
html_2 = """ 
<p>i love it really</p>
"""

它们共享相同的标签结构,因此它们被视为相同。但是:

html_1 = """
<div>
<p>i love it</p>
</div>
<p>i love it</p>
"""
html_2 = """ 
<div>
<p>i <em>love</em> it</p>
</div>
<p>i love it</p>
"""

我希望它返回<div>分支,因为标记结构不同。 lxmlBeautifulSoup或其他一些lib能实现吗?我试图找到一种方法来实际挑选不同的分支。

由于

2 个答案:

答案 0 :(得分:1)

更可靠的方法是在文档中构建Tree标记名称,如下所述:

以下是基于treelib.Tree的示例工作解决方案:

from bs4 import BeautifulSoup
from treelib import Tree


def traverse(parent, tree):
    tree.create_node(parent.name, parent.name, parent=parent.parent.name if parent.parent else None)

    for node in parent.find_all(recursive=False):
        tree.create_node(node.name, parent=parent.name)
        traverse(node, tree)


def compare(html1, html2):
    tree1 = Tree()
    traverse(BeautifulSoup(html1, "html.parser"), tree1)
    tree2 = Tree()
    traverse(BeautifulSoup(html2, "html.parser"), tree2)

    return tree1.to_json() == tree2.to_json()

print compare("<p>i love it</p>", "<p>i love it really</p>")
print compare("<p>i love it</p>", "<p>i <em>love</em> it</p>")

打印:

True
False

答案 1 :(得分:0)

检查两个HTML内容的标记结构的示例代码是否相同

<强>演示:

def getTagSequence(content):
    """                  
    Get all Tag Sequence
    """
    root = PARSER.fromstring(content)
    tag_sequence = []
    for elm in root.getiterator():
        tag_sequence.append(elm.tag)
    return tag_sequence

html_1_tags = getTagSequence(html_1)
html_2_tags = getTagSequence(html_2)

if html_1_tags==html_2_tags:
     print "Tagging structure is same."
else:
     print "Tagging structure is diffrent."
     print "HTML 1 Tagging:", html_1_tags
     print "HTML 2 Tagging:", html_2_tags

注意:

上面的代码只检查标记序列,不检查父项及其子项关系,即

html_1 = """ <p> This <span>is <em>p</em></span> tag</p>"""
html_2 = """ <p> This <span>is </span><em>p</em> tag</p>"""