HTML:
<div class="col-7">
<dl class="row box">
<h2>GENERAL</h2>
<dt class="col-6">transmission:</dt>
<dd class="col-6">sequential automatic</dd>
<dt class="col-6 grey">number of seats:</dt>
<dd class="col-6">5</dd>
<dt class="col-6">first year of production:</dt>
<dd class="col-6">2017</dd>
<dt class="col-6 grey">last year of production:</dt>
<dd class="col-6">available</dd>
</dl>
<dl class="row box">
<h2>DRIVE</h2>
<dt class="col-6">fuel:</dt>
<dd class="col-6">petrol</dd>
<dt class="col-6 grey">total maximum power:</dt>
<dd class="col-6">147 kW (200 hp)</dd>
<dt class="col-6">total maximum torque:</dt>
<dd class="col-6">330 Nm</dd>
</dl>
<dl class="row box">
<h2>TRANSMISSION</h2>
<dt class="col-6">1st gear:</dt>
<dd class="col-6">5,00:1</dd>
<dt class="col-6 grey">2nd gear:</dt>
<dd class="col-6">3,20:1</dd>
</dl>
</div>
我的代码:
for item2 in soup2.find_all(attrs={'class':'col-7'}):
jj=item2.text
jj可以从我抓取的网站中提取所有值,但是我只需要几个值。例如,我只需要从GENERAL中提取座位数和生产年份的值,而从TRANSMISSION中提取1档的值。
结果应为:
5, available, 5,00:1
答案 0 :(得分:1)
您需要的信息只是标题中的下一个项目“座位数”,“生产年份”和“一档”,因此您可以使用{{1} }
zip
然后all_items = soup.find_all(attrs={'class':'col-6'})
titles = [
"number of seats",
"last year of production",
"1st gear"
]
d = {title: [] for title in titles}
for item, next_item in zip(all_items, all_items[1:]):
for title in titles:
if title in item.text:
d[title].append(next_item.text)
break
将包含您需要的所有信息
答案 1 :(得分:0)
更改find_values元组以从html文本获取值
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') find_values = ('number of seats', 'last year of production', '1st gear') for i in soup.find_all(attrs={'class': 'row box'}): for j in i.find_all('dt'): text = j.get_text().lower().strip() if text.startswith(find_values): print(text, j.find_next_sibling('dd').get_text())