<script type="application/ld+json">
{"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
"@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=10"
}
,"itemListElement":[
{"@type":"ListItem","position":2251,"url":"https://job-openings.monster.com/19-16001-senior-python-developer-boulder-co-us-sunrise-systems-inc/e09cfe38-2a32-465d-bd66-8846b9549c6a"}
L = [
"https://job-openings.monster.com/senior-python-architect-boulder-co-us-experis/26b7c4e8-ec4f-4d93-84e4-959fd28e150a",
"https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc",
"https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e",
"https://job-openings.monster.com/immediate-need-for-python-developer-6-month-contract-onsite-in-boulder-co-boulder-co-us-addon-technologies-inc/e2826a70-490b-4e16-a4bb-05e767c8fb1f",
"https://job-openings.monster.com/software-test-technician-englewood-co-us-kratos-defense-security-solutions/fa39cdfe-0fe8-4e02-b325-28f21561ac33"
]
page=1
参数结尾。我们希望增加它,直到“加载更多作业”按钮变为“没有更多结果”消息为止。import itertools as itts
import string
import urllib.request
# BEFORE CLICK `LOAD MORE JOBS` BUTTON
# https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2
# AFTER CLICK LOAD MORE JOBS
# https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=3
# AT END OF URL, `page=2` changes to `page=3`
prefix = "https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=1"
sentinel = """
<a class="mux-btn btn-secondary no-more-jobs-btn disabled "
style="display:none" id="noMoreResults" role="button">No More Results</a>
"""
predicate = lambda ch, string=string:\
ch not in "\n\r"
sentinel = str(filter(predicate, sentinel))
for page_num in range(1, 90):
print("page_num ==", page_num)
fp = urllib.request.urlopen(prefix + str(page_num))
mybytes = fp.read()
page_html = mybytes.decode("utf8")
fp.close()
if sentinel in page_html:
break
# `page_html` is the output of the script above
print("len(page_html) == len(page_html)")
class LineIter:
def __init__(self, stryng):
self.it = it(str(stryng))
self.delims = "\n\r"
self.depleted = False
def __next__(self):
if self.depleted:
raise StopIteration()
try:
while True:
ch = next(self.it)
if ch not in self.delims:
break
line = list()
while ch not in self.delims:
line.append(ch)
ch = next(self.it)
r = "".join(line)
except StopIteration:
self.depleted = True
try:
r = "".join(line)
except BaseException:
r = ""
return r
urls = list()
for line in LineIter(page_html):
print(line)
start = line.find("https://job-openings.monster.com/")
if start >= 0:
stop = line.find('"', start)
urls.append(line[start:stop])
答案 0 :(得分:0)
正如我在评论中已经提到的那样,通常您可以仅使用provided API来避免此类情况。
但是,对于您所问的问题:手动搜索HTML字符串将非常痛苦且容易出错。这是可以做得更好的方法。
首先,使用BeautifulSoup为您解析HTML:
import requests
from bs4 import BeautifulSoup
url = 'https://www.monster.com/jobs/search/?q=python&where=aurora__2C-co&stpage=1&page=2'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
现在,您可以搜索并提取包含您要查找的数据的<script>
标签:
tags = soup('script', {'type': 'application/ld+json'})
# On this page, the data is in the second of two tags. You'll want to verify this for other pages.
tag = tags[-1]
然后,您可以将标记的内容解析为JSON:
import json
data = json.loads(tag.text)
并以字典的形式访问其中的数据:
>>> data['itemListElement']
[{'@type': 'ListItem',
'position': 51,
'url': 'https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc'},
#
# [...]
#
{'@type': 'ListItem',
'position': 108,
'url': 'https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843'}]
要获得所需的输出,只需将所有网址提取为列表,过滤所有空字符串即可:
>>> [el['url'] for el in data['itemListElement'] if el['url']]
['https://job-openings.monster.com/predictive-analytics-developer-python-100-remote-denver-co-us-edp-recruiting-services/e5041b2e-28fd-4036-9f17-0a3510a457dc',
'https://job-openings.monster.com/python-automation-engineer-denver-co-us-apidel-technologies/77e8f683-2e91-403f-b663-def61b62226e',
# [...]
'https://job-openings.monster.com/full-stack-java-engineer-denver-co-us-srinav-inc/fda6f2fd-be2f-4199-90eb-585ff8f96874',
'https://job-openings.monster.com/uipath-rpa-architect-denver-co-us-enquero/f14cc922-dd4d-4636-a3d4-dc3b18eec843']