Question

我想使用相同的divs抓取一些具有2个class="description"的URL，

示例URL的源代码如下：

<!-- Initial HTML here -->

<div class="description">
<h4> Anonymous Title </h4>
<div class="product-description">
<li> Some stuff here </li>
</div>
</div>

<!-- Middle HTML here -->

<div class="description">
Some text here
</div>

<!-- Last HTML here -->

我正在使用BeautifulSoap使用以下脚本对其进行剪贴

# imports etc here
description_box = soup.find('div', attrs={'class': 'description'})
description = description_box.text.strip()
print description

运行它只给我第一个div和class="description"，但是我只希望第二个div和class="description"。

有什么想法可以忽略第一个div而只抓取第二个吗？

P.S。前div始终具有h4标签，后div仅在标签之间具有纯文本。

Answer 1

如果您执行.find_all，它将返回列表中的全部。然后只需使用索引1在该列表中选择第二项即可。

html = '''<!-- Initial HTML here -->

<div class="description">
<h4> Anonymous Title </h4>
<div class="product-description">
<li> Some stuff here </li>
</div>
</div>

<!-- Middle HTML here -->

<div class="description">
Some text here
</div>

<!-- Last HTML here -->'''

soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div', {'class':'description'})
div = divs[1]

输出：

print (div)
<div class="description">
Some text here
</div>

Answer 2

使用css-selector，因为它包含nth-of-type属性以选择规范的第n个元素。而且，语法更简洁。

description_box = soup.select("div.description:nth-of-type(2)")[0]

Answer 3

您可以在CSS中将type与类选择器一起使用，并在返回的集合中建立索引

print(soup.select('div.description')[1].text)

在BeautifulSoup中忽略具有相同类的两个div中的第一个

3 个答案: