很显然,通过查看代码,我还是对Python还是陌生的。
我正在抓取Amazon Jobs的搜索结果,但在对该URL发出约50个请求后,仍然收到连接重置错误10054。我添加了Crawlera代理网络以防止被禁止,但仍然无法正常工作。我知道网址很长,但似乎可以正常工作,而不必在网址中添加太多其他单独的部分。结果页总共约有12,000个作业,每页10个作业,所以我什至不知道是否从头开始就抓取了那么多数据。亚马逊将URL中的每个页面显示为“ result_limit = 10”,因此我浏览每个页面的时间为10秒,而不是每个请求1个页面。不确定是否正确。另外,最后一页停止在9,990。
代码有效,但不确定如何通过连接错误。如您所见,我添加了诸如用户代理之类的东西,但不确定它是否还能执行任何操作。任何帮助将不胜感激,因为我已经在无数的时间和时间上坚持不懈。谢谢!
def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()
for page in pages:
try:
ua = UserAgent()
header = {
'User-Agent': ua.random
}
response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
'&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
'latitude=&loc_group_id=&loc_query=USA&longitude=&'
'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
'normalized_location%5B%5D=San+Francisco'
'%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
'radius=24km®ion=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
'sort=relevant'.format(page),
headers=header,
proxies={
"http": "http://1ea01axxxxxxxxxxxxxxxxxxx:@proxy.crawlera.com:8010/"
}
)
# Monitor the frequency of requests
requests += 1
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
current_time = time()
elapsed_time = current_time - start_time
print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
requests / elapsed_time, datetime.now() - total_runtime))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: {}; Status code: {}".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 999:
warn("Number of requests was greater than expected.")
break
yield from get_job_infos(response)
except AttributeError as e:
print(e)
continue
def get_job_infos(response):
amazon_jobs = json.loads(response.text)
for website in amazon_jobs['jobs']:
site = website['company_name']
title = website['title']
location = website['normalized_location']
job_link = 'https://www.amazon.jobs' + website['job_path']
yield site, title, location, job_link
def main():
# Page range starts from 0 and the middle value increases by 10 each page.
pages = [str(i) for i in range(0, 9990, 10)]
with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Website", "Title", "Location", "Job URL"])
writer.writerows(get_all_jobs(pages))
if __name__ == "__main__":
main()
答案 0 :(得分:0)
我不是Amazon Anti bot策略的专家,但是如果它们曾经标记过一次,则可能会标记您的IP一段时间,它们可能会限制您在特定时间范围内可以执行的请求数量。 google urllib的补丁程序,以便您可以实时查看请求标头,而不是每个特定时间段的ip / domain,亚马逊会查看您的请求标头以确定您是否不是人类。将您发送的内容与常规浏览器请求标头进行比较
只是标准做法,将Cookie保留正常的时间,请使用适当的引荐来源网址和受欢迎的用户代理 所有这些都可以通过请求库,pip安装请求来完成,请参见会话对象
您似乎是在向没有引荐标头的内部亚马逊网址发送请求。....在普通浏览器中不会发生
另一个示例,保留一个用户代理中的cookie,然后切换到另一个用户代理也不是浏览器