我正在尝试编写网页抓取代码以获取 Linkedin 职位发布的信息,包括职位描述、日期、角色和 Linkedin 职位发布的链接。虽然我在获取有关工作职位的工作信息方面取得了很大进展,但我目前在如何获取每个工作职位的“href”链接方面遇到了困难。我做了很多尝试,包括使用 class driver.find_element_by_class_name 和 select_one 方法,但似乎都没有通过结果无值获得“规范”链接。你能给我一些光吗?
这是我尝试获取 href 链接的代码部分:
import requests
from bs4 import BeautifulSoup
url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('link'):
print(link.get('href'))
存储href链接的代码图片
答案 0 :(得分:0)
我认为您试图错误地访问 href
属性,要访问它们,请使用 object["attribute_name"]
。
这对我有用,只搜索 rel = "canonical"
的链接:
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
print(link['href'])
答案 1 :(得分:0)
<link>
的属性为 rel="canonical"
。您可以使用 [attribute=value]
CSS 选择器:[rel="canonical"]
来获取值。
要使用 CSS 选择器,请使用 .select_one()
方法而不是 find()
。
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup.select_one('[rel="canonical"]')['href'])
输出:
https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D