python使用url抓取最大重试次数

时间:2016-02-05 02:00:17

标签: python beautifulsoup python-requests

我想每天抓一次网站(~3000页)。我的请求代码是:

#Return soup of page of Carsales ads
def getsoup(pagelimit,offset):
    url = "http://www.carsales.com.au/cars/results?q=%28Service%3D%5BCarsales%5D%26%28%28%28SiloType%3D%5BDemo%20and%20near%20new%20cars%5D%7CSiloType%3D%5BDealer%20used%20cars%5D%29%7CSiloType%3D%5BDemo%20and%20near%20new%20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29&cpw=1&sortby={0}&limit={1}&offset={2}".format('Price',str(pagelimit),str(offset))
    #Sortby options: LastUpdated,Price,
    r = requests.get(url, headers)
    soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
    totalpages = int(soup.find("div",  class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0])
    currentpage = int(soup.find("div",  class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])   
    return (soup, totalpages, currentpage)

adscrape = []
#Run through all search result pages, appending ads to adscrape    
while currentpage < totalpages:
    soup, totalpages, currentpage = getsoup(pagelimit,offset)
    print 'Page: {0} of {1}. Offset is {2}.'.format(currentpage,totalpages,offset)
    adscrape.extend(getpageads(soup,offset))
    offset = offset+pagelimit
    # sleep(1) 

我之前已成功运行它,没有sleep()函数来限速。但是现在我在执行过程中遇到错误,无论sleep(1)是否在代码中处于活动状态,它都会执行此操作:

...    
Page: 1523 of 2956. Offset is 91320.
Page: 1524 of 2966. Offset is 91380.
Page: 1525 of 2956. Offset is 91440.
Traceback (most recent call last):
  File "D:\Google Drive\pythoning\carsales\carsales_v2.py", line 82, in <module>
    soup, totalpages, currentpage = getsoup(pagelimit,offset)
  File "D:\Google Drive\pythoning\carsales\carsales_v2.py", line 28, in getsoup
    r = requests.get(url, headers)
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python27\lib\site-packages\requests\adapters.py", line 423, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.carsales.com.au', port=80): Max retries exceeded with url: /cars/results?q=%28Service%3D%5BCarsales%5D%26%28%28%28SiloType%3D%5BDemo%20and%20near%20new%20cars%5D%7CSiloType%3D%5BDealer%20used%20cars%5D%29%7CSiloType%3D%5BDemo%20and%20near%20new%20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29&cpw=1&sortby=Price&limit=60&offset=91500&user-agent=Mozilla%2F5.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x000000001A097710>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
<<< Process finished. (Exit code 1)

我假设这是因为在特定时间内请求服务器的请求过多。如果是这样,如何在不使脚本运行数小时的情况下避免这种情况?抓取网站来解决这个问题的正常做法是什么?

0 个答案:

没有答案