如何将Backoff脚本插入到我的网页抓取中

时间:2019-06-10 19:35:06

标签: web-scraping exponential-backoff retrying

我想在我的网页抓取中使用“ Backoff”软件包,但无法正常使用。在哪里插入?如何使“ r =请求...”仍然被识别?

我试图以各种方式将语句放入我的代码中,但它不起作用。我希望能够将其用于软件包的预期目的。谢谢!

要插入的代码

@backoff.on_exception(backoff.expo,
                      requests.exceptions.RequestException,
                      max_time=60)

def get_url(what goes here?):
    return requests.get(what goes here?)

现有代码:

import os
import requests
import re
import backoff

asin_list = ['B079QHML21']
urls = []
print('Scrape Started')
for asin in asin_list:
  product_url = f'https://www.amazon.com/dp/{asin}'
  urls.append(product_url)
  base_search_url = 'https://www.amazon.com'
  scraper_url = 'http://api.scraperapi.com'

  while len(urls) > 0:
    url = urls.pop(0)
    payload = {key, url}  #--specific parameters
    r = requests.get(scraper_url, params=payload)
    print("we got a {} response code from {}".format(r.status_code, url))
    soup = BeautifulSoup(r.text, 'lxml')

    #Scraping Below#

我希望“ Backoff”代码能够按其设计的那样工作,以重试500个错误且没有失败

1 个答案:

答案 0 :(得分:0)

与其直接致电:

requests.get(scraper_url, params=payload)

更改get_url以完成此操作,然后调用get_url

@backoff.on_exception(backoff.expo,
                      requests.exceptions.RequestException,
                      max_time=60)

def get_url(scraper_url, payload):
    return requests.get(scraper_url, params=payload)

,而不是您的代码中

r = requests.get(scraper_url, params=payload)

这样做:

r = get_url(scraper_url, payload)