Question

<html><table>...<table>...</table>...</table><table>...</table>...</html>

例如，如果我使用soup.find_all（'table'）作为上述汤，我将获得三个表。我想让它在获得第一个表实例后不会深入挖掘汤，并尝试在当前实例之后找到该表的下一个实例。换句话说，它将返回第一个表，其中嵌套在内部的表和第二个表。我想知道实施这一行动的最有效方法是什么。

现在，我正在处理它：

from copy import deepcopy
tables = soup.find_all('table')
reduced_tables = deepcopy(tables)
for table in tables:
    if list(filter(lambda x: table !=x and table in x, tables)) != []:
        reduced_tables.remove(table)

Answer 1

选项1：使用recursive中的find_all参数仅选择标记的直接后代。

选项2：在find_all中使用lambda并按find_parent过滤以选择不是某个标记后代的标记。

示例：

html = '''
<html>
<table>table 1</table>
<div><table>table 2</table></div>
<table>table 3<table>table 4</table></table>
<html>
'''
soup = BeautifulSoup(html, 'html.parser') 

tables = soup.html.find_all('table', recursive=False)
print(tables)

tables = soup.find_all(lambda tag: tag.name=='table' and not tag.find_parent('table'))
print(tables)

输出：

[<table>table 1</table>, <table>table 3<table>table 4</table></table>]

[<table>table 1</table>, <table>table 2</table>, <table>table 3<table>table 4</table></table>]

第一个选项没有选择表格2，因为它不是＆＃39; html＆＃39;的直接后代，但是第二个选项返回了所有3个顶级表格。

如果要删除嵌套表（表4），请使用decompose方法，例如：

for table in tables:
    for tag in table.find_all('table'):
        tag.decompose()
print(tables)

[<table>table 1</table>, <table>table 2</table>, <table>table 3</table>]

如何在Beautiful Soup中找到最顶级的搜索实例？

1 个答案: