使用 Beautifulsoup 在 Python 中抓取 href 链接

时间:2021-05-12 16:52:54

标签: python beautifulsoup

我正在尝试编写网页抓取代码以获取 Linkedin 职位发布的信息,包括职位描述、日期、角色和 Linkedin 职位发布的链接。虽然我在获取有关工作职位的工作信息方面取得了很大进展,但我目前在如何获取每个工作职位的“href”链接方面遇到了困难。我做了很多尝试,包括使用 class driver.find_element_by_class_name 和 select_one 方法,但似乎都没有通过结果无值获得“规范”链接。你能给我一些光吗?

这是我尝试获取 href 链接的代码部分:

    import requests
    from bs4 import BeautifulSoup

    url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click

    reqs = requests.get(url)
    soup = BeautifulSoup(reqs.text, 'html.parser')
    urls = []
    for link in soup.find_all('link'):
       print(link.get('href'))

链接:https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click

存储href链接的代码图片

2 个答案:

答案 0 :(得分:0)

我认为您试图错误地访问 href 属性,要访问它们,请使用 object["attribute_name"]

这对我有用,只搜索 rel = "canonical" 的链接:

import requests
from bs4 import BeautifulSoup

url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"

reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
    print(link['href'])

答案 1 :(得分:0)

<link> 的属性为 rel="canonical"。您可以使用 [attribute=value] CSS 选择器:[rel="canonical"] 来获取值。

要使用 CSS 选择器,请使用 .select_one() 方法而不是 find()

import requests
from bs4 import BeautifulSoup

url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')

print(soup.select_one('[rel="canonical"]')['href'])

输出:

https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D