如何从工作分类网站monster.com上抓取链接?

时间:2019-11-25 20:58:36

标签: python python-3.x web-scraping

我想从显示搜索结果的monster.com页面上抓取特定作业分类的URL:

enter image description here

如果您查看html,您将看到URLS处于这样的块中:

<script type="application/ld+json">
            {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=python&amp;where=aurora__2C-co&amp;stpage=1&amp;page=10"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":2251,"url":"https://job-openings.monster.com/19-16001-senior-python-developer-boulder-co-us-sunrise-systems-inc/e09cfe38-2a32-465d-bd66-8846b9549c6a"}

我们的网络抓取程序所需的输出是字符串列表:

L = [
    "https://job-openings.monster.com/senior-python-architect-boulder-co-us-experis/26b7c4e8-ec4f-4d93-84e4-959fd28e150a",
    "https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc",
    "https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e",
    "https://job-openings.monster.com/immediate-need-for-python-developer-6-month-contract-onsite-in-boulder-co-boulder-co-us-addon-technologies-inc/e2826a70-490b-4e16-a4bb-05e767c8fb1f",
    "https://job-openings.monster.com/software-test-technician-englewood-co-us-kratos-defense-security-solutions/fa39cdfe-0fe8-4e02-b325-28f21561ac33" 
]

显示搜索结果的网页以page=1参数结尾。我们希望增加它,直到“加载更多作业”按钮变为“没有更多结果”消息为止。

picture

下面提供了我的代码。效果不是很好:

import itertools as itts
import string
import urllib.request

# BEFORE CLICK `LOAD MORE JOBS` BUTTON
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2
# AFTER CLICK LOAD MORE JOBS
#    https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=3
# AT END OF URL, `page=2` changes to `page=3`

prefix = "https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=1"

sentinel = """
<a class="mux-btn btn-secondary no-more-jobs-btn disabled "
style="display:none" id="noMoreResults" role="button">No More Results</a>
"""
predicate = lambda ch, string=string:\
    ch not in "\n\r"
sentinel = str(filter(predicate, sentinel))

for page_num in range(1, 90):
    print("page_num ==", page_num)
    fp = urllib.request.urlopen(prefix + str(page_num))
    mybytes = fp.read()
    page_html = mybytes.decode("utf8")
    fp.close()
    if sentinel in page_html:
        break
# `page_html` is the output of the script above

print("len(page_html) == len(page_html)")

class LineIter:
    def __init__(self, stryng):
        self.it = it(str(stryng))
        self.delims = "\n\r"
        self.depleted = False
    def __next__(self):
        if self.depleted:
            raise StopIteration()
        try:
            while True:
                ch = next(self.it)
                if ch not in self.delims:
                    break
            line = list()
            while ch not in self.delims:
                line.append(ch)
                ch = next(self.it)
            r = "".join(line)
        except StopIteration:
            self.depleted = True
            try:
                r = "".join(line)
            except BaseException:
                r = ""
        return r

urls = list()
for line in LineIter(page_html):
    print(line)
    start = line.find("https://job-openings.monster.com/")
    if start >= 0:
        stop = line.find('"', start)
        urls.append(line[start:stop])

1 个答案:

答案 0 :(得分:0)

正如我在评论中已经提到的那样,通常您可以仅使用provided API来避免此类情况。

但是,对于您所问的问题:手动搜索HTML字符串将非常痛苦且容易出错。这是可以做得更好的方法。

首先,使用BeautifulSoup为您解析HTML:

import requests
from bs4 import BeautifulSoup

url = 'https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2'
html = requests.get(url).text

soup = BeautifulSoup(html, 'lxml')

现在,您可以搜索并提取包含您要查找的数据的<script>标签:

tags = soup('script', {'type': 'application/ld+json'})

# On this page, the data is in the second of two tags. You'll want to verify this for other pages.
tag = tags[-1]

然后,您可以将标记的内容解析为JSON:

import json

data = json.loads(tag.text)

并以字典的形式访问其中的数据:

>>> data['itemListElement']
[{'@type': 'ListItem',
  'position': 51,
  'url': 'https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc'},
#
# [...]
#
 {'@type': 'ListItem',
  'position': 108,
  'url': 'https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843'}]

要获得所需的输出,只需将所有网址提取为列表,过滤所有空字符串即可:

>>> [el['url'] for el in data['itemListElement'] if el['url']]
['https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc',
 'https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e',
# [...]
 'https://job-openings.monster.com/full-stack-java-engineer-denver-co-us-srinav-inc/fda6f2fd-be2f-4199-90eb-585ff8f96874',
 'https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843']