我已经使用Python中的beautifulSoup编写了一个脚本,我正在使用它来从网站上抓取工作职位(我已获得许可)。
问题
刮刀效果很好,但是对于不同的工作位置,它会返回相同的标题,而在工作发布时它们应该是不同的。
代码
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
links = soup.select("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
title = soup.select_one("h1.page-intro__title").get_text() if
soup.select_one("h1.section__title") else ""
overview = soup.select_one("p.page-intro__longDescription").get_text()
details = soup.select_one("div.rte").get_text()
print(title, link, details)
结果
出于某种原因,所有工作职位都被赋予相同的职位,但其他所有职位都是唯一的(网址,副本等)。
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-to-improve-value-creation-and-finance-functions/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-with-unique-competences-within-hr-excellence/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-supply-chain-management/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-leadership-development-or-change-management/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-to-help-our-customers-succeed-with-it/"
预期结果
结果应如下所示,标题是唯一的:
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-leadership-development-or-change-management/"
TITLE: Management Consultants to help our customers succeed with IT functions\r\n LINK href="/career/management-consultants-to-help-our-customers-succeed-with-it/"
被修改
尝试了以下代码,但仍然看到所有职位的标题相同:
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
for link in soup.select("a"):
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
try:
title = soup.select_one("h1.page-intro__title").get_text().strip()
except:
title = ''
print(title)
答案 0 :(得分:2)
应用此功能,希望它能解决问题:
title = soup.select_one("h1.page-intro__title").get_text() if soup.select_one("h1.section__title") else ""
而且,你也可以这样:
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
for link in soup.select("a"):
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
try:
title = soup.select_one("h1.page-intro__title").get_text().strip()
except:
title = ''
print(title)
结果如下:
Management consultants to improve value creation and finance functions
Management consultants with unique competences within Organisation & HR
Management consultants within supply chain management
Management consultants within leadership development or change management
Management consultants to help our customers succeed with IT
Management consultants within process improvement
更新结果
(u'Management consultants to improve value creation and finance functions', <a
class="box-link" href="/career/management-consultants-to-improve-value-
creation-and-finance-functions/">\n<h2
(u'Management consultants to improve value creation and finance functions', <a
class="box-link" href="/career/management-consultants-with-unique-competences-
within-hr-excellence/">\n<h2
(u'Management consultants to improve value creation and finance functions', <a
class="box-link" href="/career/management-consultants-within-supply-chain-
管理/&#34;&GT; \ n