我想每天抓一次网站(~3000页)。我的请求代码是:
#Return soup of page of Carsales ads
def getsoup(pagelimit,offset):
url = "http://www.carsales.com.au/cars/results?q=%28Service%3D%5BCarsales%5D%26%28%28%28SiloType%3D%5BDemo%20and%20near%20new%20cars%5D%7CSiloType%3D%5BDealer%20used%20cars%5D%29%7CSiloType%3D%5BDemo%20and%20near%20new%20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29&cpw=1&sortby={0}&limit={1}&offset={2}".format('Price',str(pagelimit),str(offset))
#Sortby options: LastUpdated,Price,
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
totalpages = int(soup.find("div", class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0])
currentpage = int(soup.find("div", class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
return (soup, totalpages, currentpage)
adscrape = []
#Run through all search result pages, appending ads to adscrape
while currentpage < totalpages:
soup, totalpages, currentpage = getsoup(pagelimit,offset)
print 'Page: {0} of {1}. Offset is {2}.'.format(currentpage,totalpages,offset)
adscrape.extend(getpageads(soup,offset))
offset = offset+pagelimit
# sleep(1)
我之前已成功运行它,没有sleep()函数来限速。但是现在我在执行过程中遇到错误,无论sleep(1)是否在代码中处于活动状态,它都会执行此操作:
...
Page: 1523 of 2956. Offset is 91320.
Page: 1524 of 2966. Offset is 91380.
Page: 1525 of 2956. Offset is 91440.
Traceback (most recent call last):
File "D:\Google Drive\pythoning\carsales\carsales_v2.py", line 82, in <module>
soup, totalpages, currentpage = getsoup(pagelimit,offset)
File "D:\Google Drive\pythoning\carsales\carsales_v2.py", line 28, in getsoup
r = requests.get(url, headers)
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 423, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.carsales.com.au', port=80): Max retries exceeded with url: /cars/results?q=%28Service%3D%5BCarsales%5D%26%28%28%28SiloType%3D%5BDemo%20and%20near%20new%20cars%5D%7CSiloType%3D%5BDealer%20used%20cars%5D%29%7CSiloType%3D%5BDemo%20and%20near%20new%20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29&cpw=1&sortby=Price&limit=60&offset=91500&user-agent=Mozilla%2F5.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x000000001A097710>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
<<< Process finished. (Exit code 1)
我假设这是因为在特定时间内请求服务器的请求过多。如果是这样,如何在不使脚本运行数小时的情况下避免这种情况?抓取网站来解决这个问题的正常做法是什么?