Question

我试图在我的requests.get操作中实现RequestsThrottler，该操作查询大约50个网站。

https://pypi.python.org/pypi/RequestsThrottler/0.2.2

RequestsThrottler之前的代码（有效）：

secondCrawlRequest = requests.get(row[6],headers=http_headers, timeout=5)
raw_html = secondCrawlRequest.text
SoupParser = BeautifulSoup(raw_html, 'html.parser')
results = SoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})

for para in results.findAll('p'):
    para_text = para.text.strip()
    list_of_paras.append(para_text)

添加了RequestsThrottler的代码（失败）

with BaseThrottler(name='base-throttler', delay=1.5) as bt:
     secondCrawlRequest = requests.get(row[6],headers=http_headers, timeout=5)
     reqs = [secondCrawlRequest for i in range(0, 5)]
     throttled_requests = bt.multi_submit(reqs)
     # where does responses get passed too?
     responses = [tr.response for tr in throttled_requests]
     raw_html = secondCrawlRequest.text
     SoupParser = BeautifulSoup(raw_html, 'html.parser')
     results = SoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
for para in results.findAll('p'):
    para_text = para.text.strip()
    list_of_paras.append(para_text)

代码失败，因为我没有通过“回复”＃39;参数正确。

我的错误是：

File＆＃34; / Users / helloWorld / Python Projects / Harvesters - web01_harvester.py＆＃34;，第264行，in for result infindAll（＆＃39; p＆＃39;）： NameError：name＆＃39; results＆＃39;未定义

如何通过“回复”＃39;参数是否正确？

Answer 1

限制器必须放在所有请求的“上方”，所以我假设在工作但没有节流的代码周围有一个循环。

paragraphs = list()
with BaseThrottler(delay=1.5) as throttler:
    throttled_requests = throttler.multi_submit(
        [
            requests.Request(row[6], headers=http_headers, timeout=5)
            for row in rows
        ]
    )

for request in throttled_requests:
    soup = BeautifulSoup(request.get_response().text, 'html.parser')
    div = soup.find('div', attrs={'style': 'padding-left:10px;width:98%'})
    for paragraph in div.find_all('p'):
        paragraphs.append(paragraph.text.strip())

与BeautifulSoup

1 个答案: