BeautifulSoup4查找所有非嵌套匹配

时间:2019-01-18 17:49:19

标签: python python-3.x web-scraping recursive-datastructures

我无法对html文档中与我的查询匹配的所有最外部元素进行简单搜索。我希望在这里提出一个简单的bs4函数来执行此操作,但事实并非如此。

请考虑以下html示例,在该示例中,我希望所有具有<div>类的最外面的"wanted"(我希望得到 2 的列表):

import bs4

text = """
<div>
    <div class="inner">
        <div class="wanted">
            I want this.
            <div class="wanted">
                I don't want that!
            </div>
        </div>
    </div>
    <div class="inner">
        <div class="wanted">
            I want this too.
        </div>
    </div>
</div>"""

soup = bs4.BeautifulSoup(text, 'lxml')

# 1. Trying all at once
fetched = soup.findAll('div', class_='wanted')
print(len(fetched))  # 3

fetched = soup.findAll('div', class_='wanted', recursive=False)
print(len(fetched))  # 0

fetched = soup.findChildren('div', class_='wanted')
print(len(fetched))  # 3

fetched = soup.findChildren('div', class_='wanted', recursive=False)
print(len(fetched))  # 0


# 2. Trying one after the other
fetched = []
fetched0 = soup.find('div', class_='wanted')

while fetched0:
    fetched.append(fetched0)
    descendants = list(fetched0.descendants)
    fetched0 = descendants[-1].findNext('div', class_='wanted')

print(len(fetched))  # 2  Hurra!

# 3. Destructive method: if you don't care about the parents of this element
fetched = []
fetched0 = soup.find('div', class_='wanted')
while fetched0:
    fetched.append(fetched0.extract())
    fetched0 = soup.find('div', class_='wanted')
print(len(fetched))

因此,# 1.部分中没有任何内容可以提供预期的结果。因此findAllfindChildren有什么区别?并且findNextSibling与此处的嵌套无关。

现在,第# 2.部分有效,但是为什么需要编写那么多代码?有没有更优雅的解决方案?至于# 3.部分,我必须小心我猜到的后果。

您对此搜索有何建议?我真的找到了最短的方法吗?我可以使用一些CSS选择魔术吗?

2 个答案:

答案 0 :(得分:1)

you could pass in a function as argument to find_all, in addition to other arguments. And inside it you could check with find_parents() to make sure it does not have a any top-level div with the same class. Use find_parents() as it will check for all parents not just its immediate parent, so that you get only the outermost 'wanted' div.

def top_most_wanted(tag):
    children_same_class=tag.find_parents("div", class_="wanted")
    if len(children_same_class) >0:
        return False
    return True
soup=BeautifulSoup(text,'html.parser')
print(soup.find_all(top_most_wanted,'div',class_="wanted"))

答案 1 :(得分:0)

我终于做到了以下几点,其优点是不会造成破坏。另外,我没有时间对其进行基准测试,但我只是希望这样可以避免像@ Bitto-Bennichan答案中那样遍历每个嵌套元素,但这确实不确定。无论如何,它满足了我的要求:

all_fetched = []
fetched = soup.find('div', class_='wanted')

while fetched is not None:
    all_fetched.append(fetched)
    try:
        last = list(fetched.descendants)[-1]
    except IndexError:
        break
    fetched = last.findNext('div', class_='wanted')