使用.find()从带有BS4的html页面中提取两个相同的'div'中的第二个

时间:2019-07-07 04:25:05

标签: python web-scraping beautifulsoup

我正在尝试从汤元素中提取2个相同的“ div”中的第二个。解析槽并使用.find()方法提取时,它排在最前面。如果满足某些条件,如何告诉脚本跳过第一个脚本并获取下一个脚本?下面是我要从中提取的html代码。

<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>

这是我正在尝试的代码:

if '$' not in str(product.find('div', {'class': 'a-row a-size-base a-color-secondary'})):
    print('NOT IN')
    pass
    price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
    print(price)
else:
    price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
    print(price)

但是结果仍然给我这个:

NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>

而不是这样:

<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div> 

有什么建议吗?

2 个答案:

答案 0 :(得分:1)

您需要find_all然后索引到返回列表中,因为find仅返回第一个匹配项。您可以使用select做同样的事情。使用bs4 4.7.1。您可以使用:contains通过子字符串(例如innerText)将元素CONtv trial定位为目标,然后如果需要首次匹配则使用select_one,如果需要多个匹配则使用select。您想先测试if None,然后再尝试访问.text

from bs4 import BeautifulSoup as bs
import requests

html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)

使用find_all循环

matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
    if '$' in str(item):
        print(item.text)

答案 1 :(得分:1)

假设div现在直接 <body>下,则可以使用标准的Python索引。在您的真实代码中,将选择器中的body替换为适当的元素:

data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'lxml')

print(soup.select('body > div')[1].text.strip())

打印:

$0.00 with a CONtv trial on Prime Video Channels

请注意>中的select()登录,这意味着我们希望所有<div>直接 <body>下。