我正在编写一些HTML预处理脚本,这些脚本正在从Web爬网程序清除/标记HTML,以便在随后的语义/链接分析步骤中使用。我从HTML中过滤掉了不需要的标签,并将其简化为仅包含可见文本和<div>
/ <a>
元素。
我现在正尝试编写一个“ collapseDOM()”函数以遍历DOM树并执行以下操作:
(1)销毁没有可见文本的叶节点
(2)折叠任何<div>
,如果它(a)直接不包含可见文本并且(b)仅具有一个<div>
子代,则将其替换为其子代
例如,如果我输入以下HTML:
<html>
<body>
<div>
<div>
<a href="www.foo.com">not collapsed into empty parent: only divs</a>
</div>
</div>
<div>
<div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
</div>
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
应该将其转换为以下“折叠”版本:
<html>
<body>
<div>
<a href="www.foo.com">not collapsed into empty parent: only divs</a>
</div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
我一直无法弄清楚该怎么做。我尝试使用BeautifulSoup的unwrap()
和decompose()
方法编写一个递归树遍历函数,但是这在迭代DOM时修改了DOM,我不知道如何使它工作。 / p>
有一种简单的方法可以做我想要的吗?我对BeautifulSoup或lxml中的解决方案持开放态度。谢谢!
答案 0 :(得分:2)
您可以从此开始并根据自己的需要进行调整:
def stripTagWithNoText(soup):
def remove(node):
for index, item in enumerate(node.contents):
if not isinstance(item, NavigableString):
currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[\s+]', '', text)) > 0)]
parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[\s+]', '', text)) > 0)]
if len(currentNodes) == 1 and item.name == item.parent.name:
if len(parentNodes) > 1:
continue
if item.name == currentNodes[0].name and len(currentNodes) == 1:
item.replaceWithChildren()
node.unwrap()
for tag in soup.find_all():
remove(tag)
print(soup)
soup = BeautifulSoup(data, "lxml")
stripTagWithNoText(soup)
<html> <body> <div> <a href="www.foo.com">not collapsed into empty parent: only divs</a> </div> <div> inner div not collapsed because this contains text <div>some more text ...</div> but the outer nested divs do get collapsed </div> <div> <div>This won't be collapsed into parent because </div> <div>there are two children ...</div> </div> </body> </html>