Question

<div class="info">
       <h3> Height:
            <span>1.1</span>
       </h3>
</div>

<div class="info">
       <h3> Number:
            <span>111111111</span>
       </h3>
</div>

这是网站的部分内容。最终，我想提取111111111.我知道我能做到 soup.find_all("div", { "class" : "info" }) 得到两个div的列表;但是，我宁愿不必执行循环来检查它是否包含文本“Number”。

是否有一种更优雅的方式来提取“1111111”以使它soup.find_all("div", { "class" : "info" })，但也使它必须包含“数字”？

我也试过numberSoup = soup.find('h3', text='Number') 但它返回None

Answer 1

您可以编写自己的过滤器函数，并将其作为函数find_all的参数。

from bs4 import BeautifulSoup

def number_span(tag):
    return tag.name=='span' and 'Number:' in tag.parent.contents[0]

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all(number_span)

顺便说一句，您无法使用text参数获取标记的原因是：text param帮助我们找到.string值等于其值的标记。如果标签包含多个内容，则不清楚.string应该引用什么。因此.string定义为None。

您可以参考beautiful soup doc。

Answer 2

使用xpath contains：

root.xpath('//div/h3[contains(text(), "Number")]/span/text()')

Python BeautifulSoup查找包含文本的元素

2 个答案: