如何提取<li>标签的特定部分并在该<li>标签内省略<span>标签

时间:2019-11-22 15:19:05

标签: python python-3.x beautifulsoup

我只需要从li标签中提取美元金额即可。所以输出应该像$63,606.40 - $70,137.60

html = 
<li>
Regular - Full time  
<span>-</span>
$63,606.40 - $70,137.60 Annually 
</li>

from bs4 import BeautifulSoup
import requests

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
job_elem  = soup.find('li', attrs = {'class':'list-item'}) # gives container with all we need
salary = job_elem.findAll('li')
print(salary[1])

输出:

<li>
Regular - Full time                                                            <span>-</span>
                            $63,606.40 - $70,137.60 Annually                        </li>

1 个答案:

答案 0 :(得分:1)

如果文本始终相同,那么您可以将其作为字符串获取

    text = salary[1].get_text(strip=True)

切成薄片

    print(text[20:-9])

工作代码

from bs4 import BeautifulSoup
import requests

headers = {'X-Requested-With': 'XMLHttpRequest'}

r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)

soup = BeautifulSoup(r.content, 'lxml')

all_jobs  = soup.find_all('li', attrs = {'class':'list-item'})

for job in all_jobs:
    salary = job.find_all('li')
    text = salary[1].get_text(strip=True)
    print(text[20:-9])

结果

$63,606.40 - $70,137.60
$125,000.00 - $135,000.00
$140,000.00 - $150,000.00
$79,144.00 - $96,200.00
$64,355.20 - $79,040.00
$50,356.80 - $61,193.60
$225,000.00 - $250,000.00
$87,000.00 - $100,000.00
$115,000.00 - $124,000.00
$84,864.00 - $104,228.80

编辑:如果文本可以不同,则可以使用$查找薪水起点,并使用第一个$之后的第三空格查找薪资终点。

text = '$' + text.split('$', 1)[1]
text = ' '.join(text.split(' ')[:3])
print(text)

from bs4 import BeautifulSoup
import requests

headers = {'X-Requested-With': 'XMLHttpRequest'}

r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)

soup = BeautifulSoup(r.content, 'lxml')
all_jobs  = soup.find_all('li', attrs = {'class':'list-item'}) # gives container with all we need

for job in all_jobs:
    salary = job.find_all('li')
    text = salary[1].get_text(strip=True)
    text = '$' + text.split('$', 1)[1]
    text = ' '.join(text.split(' ')[:3])
    print(text)

顺便说一句::您也可以使用regex在文本中进行搜索。但是我跳过了这一部分。


编辑:我使用正则表达式制作了版本

import re

text = salary[1].get_text(strip=True)
text = re.findall('\$[0-9,.]+ - \$[0-9,.]+', text)
print(text[0])

from bs4 import BeautifulSoup
import requests
import re

headers = {'X-Requested-With': 'XMLHttpRequest'}

r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)

soup = BeautifulSoup(r.content, 'lxml')
all_jobs  = soup.find_all('li', attrs = {'class':'list-item'}) # gives container with all we need

for job in all_jobs:
    salary = job.find_all('li')
    text = salary[1].get_text(strip=True)
    text = re.findall('\$[0-9,.]+ - \$[0-9,.]+', text)
    print(text[0])