几周前我写的一个简单的Web抓取代码不断出现以下错误: HTTP错误429:请求过多 该代码旨在从excel文件获取输入,并在线查找和下载pdf。 我对请求不太熟悉,但是我放慢了请求的数量,以查看它可以处理多少个请求。似乎这是一个无关紧要的问题。无论我坐的延迟是5秒钟还是20秒钟,代码都会经过相似数量的输入(大约30个)。这是不断出现的错误消息:
Traceback (most recent call last):
File "D:\Python\New folder\Web Scraper.py", line 17, in <module>
for url in search(searchquery, stop=1, pause=2):
File "D:\Python\lib\site-packages\google-2.0.2-py3.7.egg\googlesearch\__init__.py", line 288, in search
html = get_page(url, user_agent)
File "D:\Python\lib\site-packages\google-2.0.2-py3.7.egg\googlesearch\__init__.py", line 154, in get_page
response = urlopen(request)
File "D:\Python\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "D:\Python\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "D:\Python\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "D:\Python\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "D:\Python\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "D:\Python\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "D:\Python\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "D:\Python\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "D:\Python\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "D:\Python\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "D:\Python\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests
这是我编写的代码:
import xlrd, requests
from googlesearch import search
from time import sleep
xlloc = ("D:/VesselBase.xlsx")
#Excel location
ws = xlrd.open_workbook(xlloc)
sheet = ws.sheet_by_index(0)
#Sheet name/index
sheet.cell_value(0, 0)
for i in range(sheet.nrows):
vesselname = sheet.cell_value(i, 1)
vesselimo = sheet.cell_value(i,0)
#Which column/row to choose, 2nd column for vessels. 0=A/1.
searchquery = 'Vessel specification information "%s" OR "%s" filetype:pdf' % (vesselname, vesselimo)
print('Searching "%s"' % searchquery)
for url in search(searchquery, stop=1, pause=20):
print('Searched for %s' % vesselname)
print('Found %s' % url)
open('D:/Newfolder/%s.pdf' % vesselname, 'wb').write(requests.get(url).content)
#Where to save
print('Saved %s' % vesselname)