我正在尝试使用Beautiful Soup从我带入python的html中提取一些项目。
以下是html:
[<div class="metadata container container-max-width-modifier">
<div class="salary col-xs-12 col-sm-6 col-md-6 col-lg-6">
<i class="icon icon-pound"></i>
<span itemprop="baseSalary" itemscope="" itemtype="http://schema.org/MonetaryAmount">
<meta content="GBP" itemprop="currency"/>
<span>£7.83 - £8.83 per hour</span>
<span itemprop="value" itemscope="" itemtype="http://schema.org/QuantitativeValue">
<meta content="7.8300" itemprop="value"/>
<meta content="7.8300" itemprop="minValue"/>
<meta content="8.8300" itemprop="maxValue"/>
<meta content="HOUR" itemprop="unitText"/>
</span>
</span>
</div>
<div class="location col-xs-12 col-sm-6 col-md-6 col-lg-6">
<i class="icon icon-location-new"></i>
<span id="jobCountry" value="Scotland"></span>
<span>
<a href="/jobs/jobs-in-aberdeen" itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place">
<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<meta content="Aberdeenshire" itemprop="addressRegion"/>
<span itemprop="addressLocality">Aberdeen</span>
<meta content="GB" itemprop="addressCountry">
</meta></span>
</a>, <span>Aberdeenshire</span>
</span>
</div>
<div class="time col-xs-12 col-sm-6 col-md-6 col-lg-6">
<i class="icon icon-clock"></i>
<span content="FULL_TIME, PART_TIME" itemprop="employmentType">Permanent, full-time or part-time</span>
<meta content="full-time or part-time" itemprop="workHours"/>
</div>
<div class="applications col-xs-12 col-sm-6 col-md-6 col-lg-6">
<i class="icon icon-applicants"></i>
Be one of the first ten applicants
</div>
<ul itemscope="" itemtype="http://schema.org/BreadcrumbList" style="display:none">
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<meta content="1" itemprop="position"/>
<ul itemprop="item" itemscope="" itemtype="http://schema.org/WebPage">
<li>
<meta content="https://www.reed.co.uk/jobs/retail-jobs" itemprop="url"/>
<meta content="Retail" itemprop="name"/>
</li>
</ul>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<meta content="2" itemprop="position"/>
<ul itemprop="item" itemscope="" itemtype="http://schema.org/WebPage">
<li>
<meta content="https://www.reed.co.uk/jobs/retail-jobs" itemprop="url"/>
<meta content="Other Retail" itemprop="name"/>
</li>
</ul>
</li></li></ul>
这是我编写的代码:
salary_range = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="baseSalary").text.strip()
salary_min = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="value")
salary_time = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="unitText")
job_location = soup.find('div', class_="location col-xs-12 col-sm-6 col-md-6 col-lg-6").find('span', itemprop="addressLocality")
job_country = soup.find('div', class_="location col-xs-12 col-sm-6 col-md-6 col-lg-6").find('span', id="jobCountry")
第一个可以正常工作,因为它可以拉出工资范围。我想使用以下单独的变量:单位(例如每小时,每年,每月等),最小值,最大值,工作地点,工作国家,全职/兼职和部门。
我想我可以自己管理其中的大多数人,但是我特别遇到麻烦的是salary_min,salary_max和单位(小时,每年,每月)。对于job_country和job_location,它还会返回完整的html行,我只想在语音标记中输入文本。
如果任何人都可以提供有关如何做到这一点/做得更好的见解,我将不胜感激!
答案 0 :(得分:1)
您可以使用python的lxml库代替BeautifulSoup,请参见下面的代码。
import requests
from lxml import html
req = requests.get('https://www.reed.co.uk/jobs/barista-costa-aberdeen-tesco/36178175')
tree = html.fromstring(req.content)
salary_range = tree.xpath('.//span[@itemprop="baseSalary"]/span/text()')[0]
salary_min = tree.xpath('.//meta[@itemprop="minValue"]/@content')[0]
salary_max = tree.xpath('.//meta[@itemprop="maxValue"]/@content')[0]
salary_time = tree.xpath('.//meta[@itemprop="unitText"]/@content')[0]
job_region = tree.xpath('.//meta[@itemprop="addressRegion"]/@content')[0]
job_locality = tree.xpath('.//span[@itemprop="addressLocality"]/text()')[0]
job_country = tree.xpath('.//meta[@itemprop="addressCountry"]/@content')[0]
print('Salaray Range:', salary_range,'\n' 'Min Salary:', salary_min,'\n'
'Max Salary:', salary_max,'\n' 'Salary Time:', salary_time,'\n'
'Job Region:', job_region,'\n' 'Job Locality:', job_locality,'\n'
'Job Country:', job_country)
输出
Salaray Range: £7.83 - £8.83 per hour
Min Salary: 7.8300
Max Salary: 8.8300
Salary Time: HOUR
Job Region: Aberdeenshire
Job Locality: Aberdeen
Job Country: GB
答案 1 :(得分:1)
要获取三个字段Min Salary
,Max Salary
和Unit
,可以尝试使用以下方式。我在脚本中使用了CSS选择器,以使其外观更简洁:
import requests
from bs4 import BeautifulSoup
url = "https://www.reed.co.uk/jobs/barista-costa-aberdeen-tesco/36178175"
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
minSalary = soup.select_one('.salary meta[itemprop="minValue"]')["content"]
maxSalary = soup.select_one('.salary meta[itemprop="maxValue"]')["content"]
unit = soup.select_one('.salary meta[itemprop="unitText"]')["content"]
print(f'Min Salary: {minSalary}\nMax Salary: {maxSalary}\nUnit: {unit}')
它产生的输出:
Min Salary: 7.8300
Max Salary: 8.8300
Unit: HOUR