Python:使用bs4提取某些值

时间:2018-07-30 01:39:48

标签: python web-scraping beautifulsoup

HTML:

<div class="col-7"> 
    <dl class="row box">
        <h2>GENERAL</h2>
        <dt class="col-6">transmission:</dt>
        <dd class="col-6">sequential automatic</dd>
        <dt class="col-6 grey">number of seats:</dt>
        <dd class="col-6">5</dd>
        <dt class="col-6">first year of production:</dt>
        <dd class="col-6">2017</dd>
        <dt class="col-6 grey">last year of production:</dt>
        <dd class="col-6">available</dd>
    </dl>
        <dl class="row box">
        <h2>DRIVE</h2>
        <dt class="col-6">fuel:</dt>
        <dd class="col-6">petrol</dd>
        <dt class="col-6 grey">total maximum power:</dt>
        <dd class="col-6">147 kW (200 hp)</dd>
        <dt class="col-6">total maximum torque:</dt>
        <dd class="col-6">330 Nm</dd>
    </dl>
    <dl class="row box">
        <h2>TRANSMISSION</h2>
        <dt class="col-6">1st gear:</dt>
        <dd class="col-6">5,00:1</dd>
        <dt class="col-6 grey">2nd gear:</dt>
        <dd class="col-6">3,20:1</dd>
    </dl>
</div>

我的代码:

for item2 in soup2.find_all(attrs={'class':'col-7'}):
    jj=item2.text

jj可以从我抓取的网站中提取所有值,但是我只需要几个值。例如,我只需要从GENERAL中提取座位数和生产年份的值,而从TRANSMISSION中提取1档的值。

结果应为:

5, available, 5,00:1

2 个答案:

答案 0 :(得分:1)

您需要的信息只是标题中的下一个项目“座位数”,“生产年份”和“一档”,因此您可以使用{{1} }

zip

然后all_items = soup.find_all(attrs={'class':'col-6'}) titles = [ "number of seats", "last year of production", "1st gear" ] d = {title: [] for title in titles} for item, next_item in zip(all_items, all_items[1:]): for title in titles: if title in item.text: d[title].append(next_item.text) break 将包含您需要的所有信息

答案 1 :(得分:0)

更改find​​_values元组以从html文本获取值

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    find_values = ('number of seats', 'last year of production', '1st gear') 
    for i in soup.find_all(attrs={'class': 'row box'}):
       for j in i.find_all('dt'):
           text = j.get_text().lower().strip()
           if text.startswith(find_values):
               print(text, j.find_next_sibling('dd').get_text())