Question

很显然，通过查看代码，我还是对Python还是陌生的。

我正在抓取Amazon Jobs的搜索结果，但在对该URL发出约50个请求后，仍然收到连接重置错误10054。我添加了Crawlera代理网络以防止被禁止，但仍然无法正常工作。我知道网址很长，但似乎可以正常工作，而不必在网址中添加太多其他单独的部分。结果页总共约有12,000个作业，每页10个作业，所以我什至不知道是否从头开始就抓取了那么多数据。亚马逊将URL中的每个页面显示为“ result_limit = 10”，因此我浏览每个页面的时间为10秒，而不是每个请求1个页面。不确定是否正确。另外，最后一页停止在9,990。

代码有效，但不确定如何通过连接错误。如您所见，我添加了诸如用户代理之类的东西，但不确定它是否还能执行任何操作。任何帮助将不胜感激，因为我已经在无数的时间和时间上坚持不懈。谢谢！

def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()

for page in pages:
    try:
        ua = UserAgent()
        header = {
            'User-Agent': ua.random
        }
        response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
                       'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
                       'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
                       '&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
                       'latitude=&loc_group_id=&loc_query=USA&longitude=&'
                       'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
                       'normalized_location%5B%5D=San+Francisco'
                       '%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
                       'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
                       'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
                       'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
                       'radius=24km&region=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
                       'sort=relevant'.format(page),
                       headers=header,
                       proxies={
                           "http": "http://1ea01axxxxxxxxxxxxxxxxxxx:@proxy.crawlera.com:8010/"
                       }
                       )
        # Monitor the frequency of requests
        requests += 1

        # Pauses the loop between 8 and 15 seconds
        sleep(randint(8, 15))
        current_time = time()
        elapsed_time = current_time - start_time
        print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
              requests / elapsed_time, datetime.now() - total_runtime))
        clear_output(wait=True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn("Request: {}; Status code: {}".format(requests, response.status_code))

        # Break the loop if number of requests is greater than expected
        if requests > 999:
            warn("Number of requests was greater than expected.")
            break

        yield from get_job_infos(response)

    except AttributeError as e:
        print(e)
        continue


def get_job_infos(response):

    amazon_jobs = json.loads(response.text)

    for website in amazon_jobs['jobs']:
        site = website['company_name']
        title = website['title']
        location = website['normalized_location']
        job_link = 'https://www.amazon.jobs' + website['job_path']
        yield site, title, location, job_link


def main():
    # Page range starts from 0 and the middle value increases by 10 each page.
    pages = [str(i) for i in range(0, 9990, 10)]

    with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Website", "Title", "Location", "Job URL"])
        writer.writerows(get_all_jobs(pages))


if __name__ == "__main__":
    main()

Answer 1

我不是Amazon Anti bot策略的专家，但是如果它们曾经标记过一次，则可能会标记您的IP一段时间，它们可能会限制您在特定时间范围内可以执行的请求数量。 google urllib的补丁程序，以便您可以实时查看请求标头，而不是每个特定时间段的ip / domain，亚马逊会查看您的请求标头以确定您是否不是人类。将您发送的内容与常规浏览器请求标头进行比较

只是标准做法，将Cookie保留正常的时间，请使用适当的引荐来源网址和受欢迎的用户代理所有这些都可以通过请求库，pip安装请求来完成，请参见会话对象

您似乎是在向没有引荐标头的内部亚马逊网址发送请求。....在普通浏览器中不会发生

另一个示例，保留一个用户代理中的cookie，然后切换到另一个用户代理也不是浏览器

抓取Amazon作业结果时不断出现连接重置错误10054

1 个答案: