当使用beautifulSoup刮取公司网站时,我获得了多个工作的相同职位

时间:2017-09-02 11:43:19

标签: python web-scraping beautifulsoup

我已经使用Python中的beautifulSoup编写了一个脚本,我正在使用它来从网站上抓取工作职位(我已获得许可)。

问题

刮刀效果很好,但是对于不同的工作位置,它会返回相同的标题,而在工作发布时它们应该是不同的。

代码

import requests
from bs4 import BeautifulSoup 

base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"

req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
links = soup.select("a")

for link in links:
    if "career" in link.get("href") and 'COPENHAGEN' in link.text:
        res = requests.get(base + link.get("href")).text
        soup = BeautifulSoup(res,'html.parser')
        title = soup.select_one("h1.page-intro__title").get_text() if 
soup.select_one("h1.section__title") else ""
        overview = soup.select_one("p.page-intro__longDescription").get_text()
        details = soup.select_one("div.rte").get_text()
        print(title, link, details) 

结果

出于某种原因,所有工作职位都被赋予相同的职位,但其他所有职位都是唯一的(网址,副本等)。

TITLE:    Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-to-improve-value-creation-and-finance-functions/"

TITLE:    Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-with-unique-competences-within-hr-excellence/"

TITLE:    Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-supply-chain-management/"

TITLE:    Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-leadership-development-or-change-management/"

TITLE:    Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-to-help-our-customers-succeed-with-it/"

预期结果

结果应如下所示,标题是唯一的:

TITLE:    Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-leadership-development-or-change-management/"

TITLE:    Management Consultants to help our customers succeed with IT functions\r\n LINK href="/career/management-consultants-to-help-our-customers-succeed-with-it/"

被修改

尝试了以下代码,但仍然看到所有职位的标题相同:

import requests
from bs4 import BeautifulSoup 

base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"

req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')

for link in soup.select("a"):
    if "career" in link.get("href") and 'COPENHAGEN' in link.text:
        res = requests.get(base + link.get("href")).text
        soup = BeautifulSoup(res,'html.parser')
        try:
            title = soup.select_one("h1.page-intro__title").get_text().strip()
         except:
            title = ''
        print(title)

1 个答案:

答案 0 :(得分:2)

应用此功能,希望它能解决问题:

title = soup.select_one("h1.page-intro__title").get_text() if soup.select_one("h1.section__title") else ""

而且,你也可以这样:

import requests
from bs4 import BeautifulSoup 

base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"

req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')

for link in soup.select("a"):
    if "career" in link.get("href") and 'COPENHAGEN' in link.text:
        res = requests.get(base + link.get("href")).text
        soup = BeautifulSoup(res,'html.parser')
        try:
            title = soup.select_one("h1.page-intro__title").get_text().strip()
        except:
            title = ''
        print(title)

结果如下:

Management consultants to improve value creation and finance functions
Management consultants with unique competences within Organisation & HR
Management consultants within supply chain management
Management consultants within leadership development or change management
Management consultants to help our customers succeed with IT
Management consultants within process improvement

更新结果

(u'Management consultants to improve value creation and finance functions', <a 
class="box-link" href="/career/management-consultants-to-improve-value-
creation-and-finance-functions/">\n<h2 
(u'Management consultants to improve value creation and finance functions', <a 
class="box-link" href="/career/management-consultants-with-unique-competences-
within-hr-excellence/">\n<h2 
(u'Management consultants to improve value creation and finance functions', <a 
class="box-link" href="/career/management-consultants-within-supply-chain-

管理/&#34;&GT; \ n