我正在尝试使用Python 3.7和BeautifulSoup进行网页抓取。我从以下html中提取了“发布名称”,“按位置排序发布类别小类别标签”,“按团队排序发布类别小类别标签”的数据,但可以不要提取“按承诺排序的投递类别小类别标签”(无论是全职还是非全职),而html结构似乎与其他结构相同:
<div class="posting" data-qa-posting-id="13f9db2f-7a80-4b50-9a61-005ad322ea2d">
<div class="posting-apply" data-qa="btn-apply">
<a href="https://jobs.lever.co/twitch/13f9db2f-7a80-4b50-9a61-005ad322ea2d" class="posting-btn-submit template-btn-submit hex-color">Apply</a>
</div>
<a class="posting-title" href="https://jobs.lever.co/twitch/13f9db2f-7a80-4b50-9a61-005ad322ea2d">
<h5 data-qa="posting-name">Account Director - DACH</h5>
<div class="posting-categories">
<span href="#" class="sort-by-location posting-category small-category-label">Hamburg, Germany</span>
<span href="#" class="sort-by-team posting-category small-category-label">Business Operations & Go-To-Market – Advertising</span>
<span href="#" class="sort-by-commitment posting-category small-category-label">Full-time</span>
</div>
</a>
</div>
我尝试为“张贴类别”创建一个单独的汤,但是没有用。
import requests
from bs4 import BeautifulSoup
from csv import writer
response = requests.get('https://jobs.lever.co/twitch')
soup = BeautifulSoup(response.text, 'html.parser')
posts = soup.findAll('div', {'class':'posting'})
with open('twitch.csv', 'w') as csv_file:
csv_writer = writer(csv_file)
headers = ['Position', 'Link', 'Location', 'Team', 'Commitment']
csv_writer.writerow(headers)
for post in posts:
position = post.find('h5',{'data-qa':'posting-name'}).text
link = post.find('a')['href']
location = post.find('span',{'class':'sort-by-location posting-category small-category-label'}).text
team = post.find('span',{'class':'sort-by-team posting-category small-category-label'}).text
commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
csv_writer.writerow([position, link, location, team, commitment])
csv中的预期结果将返回职位标题,链接(url),位置,团队和承诺。
到目前为止,我遇到以下错误:
commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
AttributeError: 'NoneType' object has no attribute 'text'
*编辑:数据集缺少最后一行,我不知道为什么:
<a class="posting-title" href="https://jobs.lever.co/twitch/c8cc56e7-75f6-4cac-9983-e0769db9dd2e">
<h5 data-qa="posting-name">Applied Scientist Intern</h5>
<div class="posting-categories">
<span href="#" class="sort-by-location posting-category small-category-label">San Francisco, CA</span>
<span href="#" class="sort-by-team posting-category small-category-label">University (Internships) – Engineering</span>
<span href="#" class="sort-by-commitment posting-category small-category-label">Intern</span>
答案 0 :(得分:0)
如果您检查html,则在某些情况下commitment
会丢失,您必须提供if条件。请立即尝试以下代码。
for post in posts:
position = post.find('h5',{'data-qa':'posting-name'}).text
link = post.find('a')['href']
location = post.find('span',{'class':'sort-by-location posting-category small-category-label'}).text
team = post.find('span',{'class':'sort-by-team posting-category small-category-label'}).text
if post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}):
commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
csv_writer.writerow([position, link, location, team, commitment])
我建议您使用css selector
而不是find
。
import requests
from bs4 import BeautifulSoup
from csv import writer
response = requests.get('https://jobs.lever.co/twitch')
soup = BeautifulSoup(response.text, 'html.parser')
posts = soup.select('div.posting')
with open('twitch.csv', 'w') as csv_file:
csv_writer = writer(csv_file)
headers = ['Position', 'Link', 'Location', 'Team', 'Commitment']
csv_writer.writerow(headers)
for post in posts:
position = post.select_one('h5[data-qa="posting-name"]').text
link = post.select_one('a')['href']
location = post.select_one('.sort-by-location').text
team = post.select_one('.sort-by-team').text
if post.select_one('.sort-by-commitment'):
commitment = post.select_one('.sort-by-commitment').text
csv_writer.writerow([position, link, location, team, commitment])
答案 1 :(得分:0)
您也可以使用try except
:
for post in posts:
try:
position = post.find('h5',{'data-qa':'posting-name'}).text
link = post.find('a')['href']
location = post.find('span',{'class':'sort-by-location posting-category small-category-label'}).text
team = post.find('span',{'class':'sort-by-team posting-category small-category-label'}).text
commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
csv_writer.writerow([position, link, location, team, commitment])
except:
continue