我正在尝试从汤元素中提取2个相同的“ div”中的第二个。解析槽并使用.find()方法提取时,它排在最前面。如果满足某些条件,如何告诉脚本跳过第一个脚本并获取下一个脚本?下面是我要从中提取的html代码。
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
这是我正在尝试的代码:
if '$' not in str(product.find('div', {'class': 'a-row a-size-base a-color-secondary'})):
print('NOT IN')
pass
price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
print(price)
else:
price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
print(price)
但是结果仍然给我这个:
NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
而不是这样:
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
有什么建议吗?
答案 0 :(得分:1)
您需要find_all
然后索引到返回列表中,因为find
仅返回第一个匹配项。您可以使用select
做同样的事情。使用bs4 4.7.1。您可以使用:contains
通过子字符串(例如innerText
)将元素CONtv trial
定位为目标,然后如果需要首次匹配则使用select_one
,如果需要多个匹配则使用select
。您想先测试if None
,然后再尝试访问.text
from bs4 import BeautifulSoup as bs
import requests
html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)
使用find_all循环
matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
if '$' in str(item):
print(item.text)
答案 1 :(得分:1)
假设div现在直接 在<body>
下,则可以使用标准的Python索引。在您的真实代码中,将选择器中的body
替换为适当的元素:
data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('body > div')[1].text.strip())
打印:
$0.00 with a CONtv trial on Prime Video Channels
请注意>
中的select()
登录,这意味着我们希望所有<div>
直接 在<body>
下。