从 'a' 标签中查找 href 并没有找到第一个 'a' 标签,我该如何解决?

时间:2021-07-19 17:21:38

标签: python web-scraping beautifulsoup href

我是 python 的新手,我确实在尝试抓取。出于某种原因,每个职位发布都保存在“a”标签下,而不是 div 下,div 也包含 href。 这是项目输出:print(item)

<a class="tapItem fs-unmask result job_e0fb3e5f520856c0 resultWithShelf sponTapItem tapItem-noPadding desktop" data-hide-spinner="true" data-jk="e0fb3e5f520856c0" data-mobtk="1favs1gn0t5v1800" href="/company/Acentury/jobs/New-Graduate-Software-Developer-e0fb3e5f520856c0?fccid=5c6453896b020232&amp;vjs=3" id="job_e0fb3e5f520856c0" rel="nofollow" target="_blank"><div class="slider_container"><div class="slider_list"><div class="slider_item"><div class="job_seen_beacon"><table cellpadding="0" cellspacing="0" class="jobCard_mainContent" role="presentation"><tbody><tr><td class="resultContent"><div class="heading4 color-text-primary singleLineTitle tapItem-gutter"><h2 class="jobTitle jobTitle-color-purple jobTitle-newJob"><div class="new topLeft holisticNewBlue desktop"><span class="label">new</span></div><span title="New Graduate Software Developer">New Graduate Software Developer</span></h2></div><div class="heading6 company_location tapItem-gutter"><pre><span class="companyName">Acentury</span><div class="companyLocation">Richmond Hill, ON<span class="remote-bullet">•</span><span>Temporarily Remote</span></div></pre></div><div class="heading6 tapItem-gutter metadataContainer"><div class="metadata salary-snippet-container"><span class="salary-snippet">$44,182 - $126,699 a year</span></div></div><div class="heading6 error-text tapItem-gutter"></div></td></tr></tbody></table><table class="jobCardShelfContainer" role="presentation"><tbody><tr class="jobCardShelf"><td class="shelfItem indeedApply"><span class="iaIcon"></span><span class="ialbl iaTextBlack">Easily apply</span></td></tr><tr class="underShelfFooter"><td><div class="heading6 tapItem-gutter result-footer"><div class="job-snippet"><ul style="list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;">
<li>Work with senior <b>developers</b> to develop front-end features on our current platform through entire R&amp;D cycle from design to implementation and official release.</li>
</ul></div><span class="date">Today</span><span class="result-link-bar-separator">·</span><button aria-expanded="false" class="sl resultLink more_links_button" type="button">More...</button></div><div class="tab-container"><div class="more-links-container result-tab" role="presentation"><div class="more_links"><button class="close-button" title="Close" type="button"></button><ul><li><span class="mat">View all <a href="/Acentury-jobs">Acentury jobs</a> - <a href="/jobs-in-Richmond-Hill,-ON">Richmond Hill jobs</a></span></li><li><span class="mat">Salary Search: <a href="/career/software-engineer/salaries/Richmond-Hill--ON?campaignid=serp-more&amp;fromjk=e0fb3e5f520856c0&amp;from=serp-more">New Graduate Software Developer salaries in Richmond Hill, ON</a></span></li></ul></div></div></div></td></tr></tbody></table><div aria-live="polite"></div></div></div><div class="slider_sub_item"></div></div></div><div class="kebabMenu"><button aria-expanded="false" aria-haspopup="true" aria-label="Job actions" class="kebabMenu-button"><svg fill="none" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg"><path d="M12 7C13.1 7 14 6.1 14 5C14 3.9 13.1 3 12 3C10.9 3 10 3.9 10 5C10 6.1 10.9 7 12 7ZM12 10C10.9 10 10 10.9 10 12C10 13.1 10.9 14 12 14C13.1 14 14 13.1 14 12C14 10.9 13.1 10 12 10ZM12 17C10.9 17 10 17.9 10 19C10 20.1 10.9 21 12 21C13.1 21 14 20.1 14 19C14 17.9 13.1 17 12 17Z" fill="#2d2d2d"></path></svg></button></div></a> 

我的代码是

divs = soup.find_all('a', class_ = 'tapItem')
for item in divs:
   for people in item.find_all('a'):
       print(people)   
       for ok in people.find_all('a', class_ = 'tapItem'):
           linkJob1 = ok.get('href')
   print(linkJob1)

people 不包含第一个 'a' 标签,而是包含其他标签,我该如何解决这个问题?谢谢

网址:https://ca.indeed.com/jobs?q=software+developer&l=Toronto%2C+ON&start=0

预期结果是每个职位/卡片的href

1 个答案:

答案 0 :(得分:1)

如果您在具有类 data-jk 的元素级别循环,您只需要其中一个 ID(作业 ID),您可以从 result 属性中提取该 ID。然后,您可以像网站一样动态构建 url:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://ca.indeed.com/jobs?q=software+developer&l=Toronto,+ON&start=0')
soup = bs(r.content, 'lxml')

for job in soup.select('.result'):
    print(job.select_one('.jobTitle').get_text(' '))
    print(f'https://ca.indeed.com/viewjob?jk={job["data-jk"]}')