Question

我正在编写一些HTML预处理脚本，这些脚本正在从Web爬网程序清除/标记HTML，以便在随后的语义/链接分析步骤中使用。我从HTML中过滤掉了不需要的标签，并将其简化为仅包含可见文本和<div> / <a>元素。

我现在正尝试编写一个“ collapseDOM（）”函数以遍历DOM树并执行以下操作：

（1）销毁没有可见文本的叶节点

（2）折叠任何<div>，如果它（a）直接不包含可见文本并且（b）仅具有一个<div>子代，则将其替换为其子代

例如，如果我输入以下HTML：

<html>
<body>
    <div>
        <div>
             <a href="www.foo.com">not collapsed into empty parent: only divs</a>
        </div>
    </div>

    <div>
        <div>
            <div>
                inner div not collapsed because this contains text 
                <div>some more text ...</div>
                but the outer nested divs do get collapsed
            </div>
        </div>
    </div>

    <div>
        <div>This won't be collapsed into parent because </div>
        <div>there are two children ...</div>
    </div>

</body>

应该将其转换为以下“折叠”版本：

<html>
<body>
    <div>
         <a href="www.foo.com">not collapsed into empty parent: only divs</a>
    </div>

    <div>
        inner div not collapsed because this contains text 
        <div>some more text ...</div>
        but the outer nested divs do get collapsed
    </div>


    <div>
        <div>This won't be collapsed into parent because </div>
        <div>there are two children ...</div>
    </div>

</body>

我一直无法弄清楚该怎么做。我尝试使用BeautifulSoup的unwrap()和decompose()方法编写一个递归树遍历函数，但是这在迭代DOM时修改了DOM，我不知道如何使它工作。 / p>

有一种简单的方法可以做我想要的吗？我对BeautifulSoup或lxml中的解决方案持开放态度。谢谢！

Answer 1

您可以从此开始并根据自己的需要进行调整：

def stripTagWithNoText(soup):

def remove(node):
    for index, item in enumerate(node.contents):
        if not isinstance(item, NavigableString):
            currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and  len(re.sub('[\s+]', '', text)) > 0)]
            parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and  len(re.sub('[\s+]', '', text)) > 0)]

            if len(currentNodes) == 1 and item.name == item.parent.name:
                if len(parentNodes) > 1:
                    continue
                if item.name == currentNodes[0].name and len(currentNodes) == 1:
                    item.replaceWithChildren()
                node.unwrap()


for tag in soup.find_all():
    remove(tag)
print(soup)

soup = BeautifulSoup(data, "lxml")
stripTagWithNoText(soup)

<html> <body> <div> <a href="www.foo.com">not collapsed into empty parent: only divs</a> </div> <div> inner div not collapsed because this contains text <div>some more text ...</div> but the outer nested divs do get collapsed </div> <div> <div>This won't be collapsed into parent because </div> <div>there are two children ...</div> </div> </body> </html>

如何使用BeautifulSoup / lxml将子DOM节点合并/折叠为父节点？

1 个答案: