如何使用Python获取amazon.jobs的所有更多链接

时间:2018-03-26 15:37:42

标签: python amazon-web-services web-scraping beautifulsoup

我是Python的初学者,我只是想从amazon工作页面中删除所有阅读的更多链接。例如,我想废弃此页面 https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county=

以下是我使用的代码。

#import the library used to query a website
import urllib2
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

#specify the url
url = "https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county="

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page, "lxml")
print soup.find_all("a")

输出:

[<a class="icon home" href="/en">Home</a>,
 <a class="icon check-status" data-target="#icims-portal-selector" data-toggle="modal">Review application status</a>,
 <a class="icon working" href="/en/working/working-amazon">Amazon culture &amp; benefits</a>,
 <a class="icon locations" href="/en/locations">Locations</a>,
 <a class="icon teams" href="/en/business_categories">Teams</a>,
 <a class="icon job-categories" href="/en/job_categories">Job categories</a>,
 <a class="icon help" href="/en/faqs">Help</a>,
 <a class="icon language" data-animate="false" data-target="#locale-options" data-toggle="collapse" href="#locale-options" id="current-locale">English</a>,
...
 <a href="/en/privacy/us">Privacy and Data</a>,
 <a href="/en/impressum">Impressum</a>]

我只获取页面中静态元素的链接,但对于任何查询都是常量,但我需要链接到4896个作业。任何人都可以指导我在哪里做错了吗?

1 个答案:

答案 0 :(得分:0)

正如您已经注意到,您的请求仅返回静态元素,因为作业链接是由js生成的。为了获得js生成的内容,您需要selenium或运行js的类似客户端 但是,如果检查HTTP流量,您会注意到作业数据由XHR请求加载到api:/search.json,后者返回json数据。

因此,使用urllib2json,我们可以获得结果总数并收集所有数据,

import urllib2
import json

api_url = 'https://www.amazon.jobs/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset=0&result_limit={results}&sort=relevant&loc_group_id=seattle-metro&latitude=&longitude=&loc_group_id=seattle-metro&loc_query={location}&base_query={query}&city=&country=&region=&county=&query_options=&'
query = ''
location = 'Greater Seattle Area, WA, United States'
request = urllib2.urlopen(api_url.format(query=query, location=location, results=10))
results = json.loads(request.read())['hits']

request = urllib2.urlopen(api_url.format(query=query, location=location, results=results))
jobs = json.loads(request.read())['jobs']
for i in jobs:
    i['job_path'] = 'https://www.amazon.jobs' + i['job_path']

jobs列表包含许多包含所有职位信息(标题,州,城市等)的词典。如果要选择特定项目(例如链接),只需循环遍历列表并选择该项目。

links = [i['job_path'] for i in jobs]
print links