使用soup.find()

时间:2018-09-22 16:08:58

标签: python html web-scraping beautifulsoup

我正在尝试使用Beautiful Soup从我带入python的html中提取一些项目。

以下是html:

[<div class="metadata container container-max-width-modifier">
 <div class="salary col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-pound"></i>
 <span itemprop="baseSalary" itemscope="" itemtype="http://schema.org/MonetaryAmount">
 <meta content="GBP" itemprop="currency"/>
 <span>£7.83 - £8.83 per hour</span>
 <span itemprop="value" itemscope="" itemtype="http://schema.org/QuantitativeValue">
 <meta content="7.8300" itemprop="value"/>
 <meta content="7.8300" itemprop="minValue"/>
 <meta content="8.8300" itemprop="maxValue"/>
 <meta content="HOUR" itemprop="unitText"/>
 </span>
 </span>
 </div>
 <div class="location col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-location-new"></i>
 <span id="jobCountry" value="Scotland"></span>
 <span>
 <a href="/jobs/jobs-in-aberdeen" itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place">
 <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
 <meta content="Aberdeenshire" itemprop="addressRegion"/>
 <span itemprop="addressLocality">Aberdeen</span>
 <meta content="GB" itemprop="addressCountry">
 </meta></span>
 </a>, <span>Aberdeenshire</span>
 </span>
 </div>
 <div class="time col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-clock"></i>
 <span content="FULL_TIME, PART_TIME" itemprop="employmentType">Permanent, full-time or part-time</span>
 <meta content="full-time or part-time" itemprop="workHours"/>
 </div>
 <div class="applications col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-applicants"></i>
                     Be one of the first ten applicants
                 </div>
 <ul itemscope="" itemtype="http://schema.org/BreadcrumbList" style="display:none">
 <li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
 <meta content="1" itemprop="position"/>
 <ul itemprop="item" itemscope="" itemtype="http://schema.org/WebPage">
 <li>
 <meta content="https://www.reed.co.uk/jobs/retail-jobs" itemprop="url"/>
 <meta content="Retail" itemprop="name"/>
 </li>
 </ul>
 <li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
 <meta content="2" itemprop="position"/>
 <ul itemprop="item" itemscope="" itemtype="http://schema.org/WebPage">
 <li>
 <meta content="https://www.reed.co.uk/jobs/retail-jobs" itemprop="url"/>
 <meta content="Other Retail" itemprop="name"/>
 </li>
 </ul>
 </li></li></ul>

这是我编写的代码:

salary_range = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="baseSalary").text.strip()
salary_min = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="value")
salary_time = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="unitText")
job_location = soup.find('div', class_="location col-xs-12 col-sm-6 col-md-6 col-lg-6").find('span', itemprop="addressLocality")
job_country = soup.find('div', class_="location col-xs-12 col-sm-6 col-md-6 col-lg-6").find('span', id="jobCountry")

第一个可以正常工作,因为它可以拉出工资范围。我想使用以下单独的变量:单位(例如每小时,每年,每月等),最小值,最大值,工作地点,工作国家,全职/兼职和部门。

我想我可以自己管理其中的大多数人,但是我特别遇到麻烦的是salary_min,salary_max和单位(小时,每年,每月)。对于job_country和job_location,它还会返回完整的html行,我只想在语音标记中输入文本。

如果任何人都可以提供有关如何做到这一点/做得更好的见解,我将不胜感激!

2 个答案:

答案 0 :(得分:1)

您可以使用python的lxml库代替BeautifulSoup,请参见下面的代码。

import requests
from lxml import html

req = requests.get('https://www.reed.co.uk/jobs/barista-costa-aberdeen-tesco/36178175')
tree = html.fromstring(req.content)
salary_range = tree.xpath('.//span[@itemprop="baseSalary"]/span/text()')[0]
salary_min = tree.xpath('.//meta[@itemprop="minValue"]/@content')[0]
salary_max = tree.xpath('.//meta[@itemprop="maxValue"]/@content')[0]
salary_time = tree.xpath('.//meta[@itemprop="unitText"]/@content')[0]
job_region = tree.xpath('.//meta[@itemprop="addressRegion"]/@content')[0]
job_locality = tree.xpath('.//span[@itemprop="addressLocality"]/text()')[0]
job_country = tree.xpath('.//meta[@itemprop="addressCountry"]/@content')[0]

print('Salaray Range:', salary_range,'\n' 'Min Salary:', salary_min,'\n'
 'Max Salary:', salary_max,'\n' 'Salary Time:', salary_time,'\n'
 'Job Region:', job_region,'\n' 'Job Locality:', job_locality,'\n'
 'Job Country:', job_country)

输出

Salaray Range: £7.83 - £8.83 per hour
Min Salary: 7.8300
Max Salary: 8.8300
Salary Time: HOUR
Job Region: Aberdeenshire
Job Locality: Aberdeen
Job Country: GB

答案 1 :(得分:1)

要获取三个字段Min SalaryMax SalaryUnit,可以尝试使用以下方式。我在脚本中使用了CSS选择器,以使其外观更简洁:

import requests
from bs4 import BeautifulSoup

url = "https://www.reed.co.uk/jobs/barista-costa-aberdeen-tesco/36178175"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
minSalary = soup.select_one('.salary meta[itemprop="minValue"]')["content"]
maxSalary = soup.select_one('.salary meta[itemprop="maxValue"]')["content"]
unit = soup.select_one('.salary meta[itemprop="unitText"]')["content"]
print(f'Min Salary: {minSalary}\nMax Salary: {maxSalary}\nUnit: {unit}')

它产生的输出:

Min Salary: 7.8300
Max Salary: 8.8300
Unit: HOUR