抓取Amazon作业结果时不断出现连接重置错误10054

时间:2019-04-11 00:10:19

标签: python web-scraping

很显然,通过查看代码,我还是对Python还是陌生的。

我正在抓取Amazon Jobs的搜索结果,但在对该URL发出约50个请求后,仍然收到连接重置错误10054。我添加了Crawlera代理网络以防止被禁止,但仍然无法正常工作。我知道网址很长,但似乎可以正常工作,而不必在网址中添加太多其他单独的部分。结果页总共约有12,000个作业,每页10个作业,所以我什至不知道是否从头开始就抓取了那么多数据。亚马逊将URL中的每个页面显示为“ result_limit = 10”,因此我浏览每个页面的时间为10秒,而不是每个请求1个页面。不确定是否正确。另外,最后一页停止在9,990。

代码有效,但不确定如何通过连接错误。如您所见,我添加了诸如用户代理之类的东西,但不确定它是否还能执行任何操作。任何帮助将不胜感激,因为我已经在无数的时间和时间上坚持不懈。谢谢!

def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()

for page in pages:
    try:
        ua = UserAgent()
        header = {
            'User-Agent': ua.random
        }
        response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
                       'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
                       'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
                       '&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
                       'latitude=&loc_group_id=&loc_query=USA&longitude=&'
                       'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
                       'normalized_location%5B%5D=San+Francisco'
                       '%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
                       'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
                       'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
                       'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
                       'radius=24km&region=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
                       'sort=relevant'.format(page),
                       headers=header,
                       proxies={
                           "http": "http://1ea01axxxxxxxxxxxxxxxxxxx:@proxy.crawlera.com:8010/"
                       }
                       )
        # Monitor the frequency of requests
        requests += 1

        # Pauses the loop between 8 and 15 seconds
        sleep(randint(8, 15))
        current_time = time()
        elapsed_time = current_time - start_time
        print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
              requests / elapsed_time, datetime.now() - total_runtime))
        clear_output(wait=True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn("Request: {}; Status code: {}".format(requests, response.status_code))

        # Break the loop if number of requests is greater than expected
        if requests > 999:
            warn("Number of requests was greater than expected.")
            break

        yield from get_job_infos(response)

    except AttributeError as e:
        print(e)
        continue


def get_job_infos(response):

    amazon_jobs = json.loads(response.text)

    for website in amazon_jobs['jobs']:
        site = website['company_name']
        title = website['title']
        location = website['normalized_location']
        job_link = 'https://www.amazon.jobs' + website['job_path']
        yield site, title, location, job_link


def main():
    # Page range starts from 0 and the middle value increases by 10 each page.
    pages = [str(i) for i in range(0, 9990, 10)]

    with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Website", "Title", "Location", "Job URL"])
        writer.writerows(get_all_jobs(pages))


if __name__ == "__main__":
    main()

1 个答案:

答案 0 :(得分:0)

我不是Amazon Anti bot策略的专家,但是如果它们曾经标记过一次,则可能会标记您的IP一段时间,它们可能会限制您在特定时间范围内可以执行的请求数量。 google urllib的补丁程序,以便您可以实时查看请求标头,而不是每个特定时间段的ip / domain,亚马逊会查看您的请求标头以确定您是否不是人类。将您发送的内容与常规浏览器请求标头进行比较

只是标准做法,将Cookie保留正常的时间,请使用适当的引荐来源网址和受欢迎的用户代理 所有这些都可以通过请求库,pip安装请求来完成,请参见会话对象

您似乎是在向没有引荐标头的内部亚马逊网址发送请求。....在普通浏览器中不会发生

另一个示例,保留一个用户代理中的cookie,然后切换到另一个用户代理也不是浏览器