我试图在我的requests.get操作中实现RequestsThrottler,该操作查询大约50个网站。
https://pypi.python.org/pypi/RequestsThrottler/0.2.2
RequestsThrottler之前的代码(有效):
secondCrawlRequest = requests.get(row[6],headers=http_headers, timeout=5)
raw_html = secondCrawlRequest.text
SoupParser = BeautifulSoup(raw_html, 'html.parser')
results = SoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
for para in results.findAll('p'):
para_text = para.text.strip()
list_of_paras.append(para_text)
添加了RequestsThrottler的代码(失败)
with BaseThrottler(name='base-throttler', delay=1.5) as bt:
secondCrawlRequest = requests.get(row[6],headers=http_headers, timeout=5)
reqs = [secondCrawlRequest for i in range(0, 5)]
throttled_requests = bt.multi_submit(reqs)
# where does responses get passed too?
responses = [tr.response for tr in throttled_requests]
raw_html = secondCrawlRequest.text
SoupParser = BeautifulSoup(raw_html, 'html.parser')
results = SoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
for para in results.findAll('p'):
para_text = para.text.strip()
list_of_paras.append(para_text)
代码失败,因为我没有通过“回复”#39;参数正确。
我的错误是:
File" / Users / helloWorld / Python Projects / Harvesters - web01_harvester.py",第264行,in for result infindAll(' p'): NameError:name' results'未定义
如何通过“回复”#39;参数是否正确?
答案 0 :(得分:0)
限制器必须放在所有请求的“上方”,所以我假设在工作但没有节流的代码周围有一个循环。
paragraphs = list()
with BaseThrottler(delay=1.5) as throttler:
throttled_requests = throttler.multi_submit(
[
requests.Request(row[6], headers=http_headers, timeout=5)
for row in rows
]
)
for request in throttled_requests:
soup = BeautifulSoup(request.get_response().text, 'html.parser')
div = soup.find('div', attrs={'style': 'padding-left:10px;width:98%'})
for paragraph in div.find_all('p'):
paragraphs.append(paragraph.text.strip())