我只需要从li
标签中提取美元金额即可。所以输出应该像$63,606.40 - $70,137.60
html =
<li>
Regular - Full time
<span>-</span>
$63,606.40 - $70,137.60 Annually
</li>
from bs4 import BeautifulSoup
import requests
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
job_elem = soup.find('li', attrs = {'class':'list-item'}) # gives container with all we need
salary = job_elem.findAll('li')
print(salary[1])
输出:
<li>
Regular - Full time <span>-</span>
$63,606.40 - $70,137.60 Annually </li>
答案 0 :(得分:1)
如果文本始终相同,那么您可以将其作为字符串获取
text = salary[1].get_text(strip=True)
切成薄片
print(text[20:-9])
工作代码
from bs4 import BeautifulSoup
import requests
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
all_jobs = soup.find_all('li', attrs = {'class':'list-item'})
for job in all_jobs:
salary = job.find_all('li')
text = salary[1].get_text(strip=True)
print(text[20:-9])
结果
$63,606.40 - $70,137.60
$125,000.00 - $135,000.00
$140,000.00 - $150,000.00
$79,144.00 - $96,200.00
$64,355.20 - $79,040.00
$50,356.80 - $61,193.60
$225,000.00 - $250,000.00
$87,000.00 - $100,000.00
$115,000.00 - $124,000.00
$84,864.00 - $104,228.80
编辑:如果文本可以不同,则可以使用$
查找薪水起点,并使用第一个$
之后的第三空格查找薪资终点。
text = '$' + text.split('$', 1)[1]
text = ' '.join(text.split(' ')[:3])
print(text)
from bs4 import BeautifulSoup
import requests
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
all_jobs = soup.find_all('li', attrs = {'class':'list-item'}) # gives container with all we need
for job in all_jobs:
salary = job.find_all('li')
text = salary[1].get_text(strip=True)
text = '$' + text.split('$', 1)[1]
text = ' '.join(text.split(' ')[:3])
print(text)
顺便说一句::您也可以使用regex
在文本中进行搜索。但是我跳过了这一部分。
编辑:我使用正则表达式制作了版本
import re
text = salary[1].get_text(strip=True)
text = re.findall('\$[0-9,.]+ - \$[0-9,.]+', text)
print(text[0])
from bs4 import BeautifulSoup
import requests
import re
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
all_jobs = soup.find_all('li', attrs = {'class':'list-item'}) # gives container with all we need
for job in all_jobs:
salary = job.find_all('li')
text = salary[1].get_text(strip=True)
text = re.findall('\$[0-9,.]+ - \$[0-9,.]+', text)
print(text[0])