Question

我是Python的初学者，我只是想从amazon工作页面中删除所有阅读的更多链接。例如，我想废弃此页面 https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county=

以下是我使用的代码。

#import the library used to query a website
import urllib2
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

#specify the url
url = "https://www.amazon.jobs/en/search?base_query=&loc_query=Greater+Seattle+Area%2C+WA%2C+United+States&latitude=&longitude=&loc_group_id=seattle-metro&invalid_location=false&country=&city=&region=&county="

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page, "lxml")
print soup.find_all("a")

输出：

[<a class="icon home" href="/en">Home</a>,
 <a class="icon check-status" data-target="#icims-portal-selector" data-toggle="modal">Review application status</a>,
 <a class="icon working" href="/en/working/working-amazon">Amazon culture &amp; benefits</a>,
 <a class="icon locations" href="/en/locations">Locations</a>,
 <a class="icon teams" href="/en/business_categories">Teams</a>,
 <a class="icon job-categories" href="/en/job_categories">Job categories</a>,
 <a class="icon help" href="/en/faqs">Help</a>,
 <a class="icon language" data-animate="false" data-target="#locale-options" data-toggle="collapse" href="#locale-options" id="current-locale">English</a>,
...
 <a href="/en/privacy/us">Privacy and Data</a>,
 <a href="/en/impressum">Impressum</a>]

我只获取页面中静态元素的链接，但对于任何查询都是常量，但我需要链接到4896个作业。任何人都可以指导我在哪里做错了吗？

Answer 1

正如您已经注意到，您的请求仅返回静态元素，因为作业链接是由js生成的。为了获得js生成的内容，您需要selenium或运行js的类似客户端但是，如果检查HTTP流量，您会注意到作业数据由XHR请求加载到api：/search.json，后者返回json数据。

因此，使用urllib2和json，我们可以获得结果总数并收集所有数据，

import urllib2
import json

api_url = 'https://www.amazon.jobs/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset=0&result_limit={results}&sort=relevant&loc_group_id=seattle-metro&latitude=&longitude=&loc_group_id=seattle-metro&loc_query={location}&base_query={query}&city=&country=&region=&county=&query_options=&'
query = ''
location = 'Greater Seattle Area, WA, United States'
request = urllib2.urlopen(api_url.format(query=query, location=location, results=10))
results = json.loads(request.read())['hits']

request = urllib2.urlopen(api_url.format(query=query, location=location, results=results))
jobs = json.loads(request.read())['jobs']
for i in jobs:
    i['job_path'] = 'https://www.amazon.jobs' + i['job_path']

jobs列表包含许多包含所有职位信息（标题，州，城市等）的词典。如果要选择特定项目（例如链接），只需循环遍历列表并选择该项目。

links = [i['job_path'] for i in jobs]
print links

如何使用Python获取amazon.jobs的所有更多链接

1 个答案: