与BeautifulSoup

时间:2016-03-24 19:18:38

标签: python beautifulsoup python-requests

我试图在我的requests.get操作中实现RequestsThrottler,该操作查询大约50个网站。

https://pypi.python.org/pypi/RequestsThrottler/0.2.2

RequestsThrottler之前的代码(有效):

secondCrawlRequest = requests.get(row[6],headers=http_headers, timeout=5)
raw_html = secondCrawlRequest.text
SoupParser = BeautifulSoup(raw_html, 'html.parser')
results = SoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})

for para in results.findAll('p'):
    para_text = para.text.strip()
    list_of_paras.append(para_text)

添加了RequestsThrottler的代码(失败)

with BaseThrottler(name='base-throttler', delay=1.5) as bt:
     secondCrawlRequest = requests.get(row[6],headers=http_headers, timeout=5)
     reqs = [secondCrawlRequest for i in range(0, 5)]
     throttled_requests = bt.multi_submit(reqs)
     # where does responses get passed too?
     responses = [tr.response for tr in throttled_requests]
     raw_html = secondCrawlRequest.text
     SoupParser = BeautifulSoup(raw_html, 'html.parser')
     results = SoupParser.find('div', attrs={'style':'padding-left:10px;width:98%'})
for para in results.findAll('p'):
    para_text = para.text.strip()
    list_of_paras.append(para_text)

代码失败,因为我没有通过“回复”#39;参数正确。

我的错误是:

File" / Users / helloWorld / Python Projects / Harvesters - web01_harvester.py",第264行,in     for result infindAll(' p'): NameError:name' results'未定义

如何通过“回复”#39;参数是否正确?

1 个答案:

答案 0 :(得分:0)

限制器必须放在所有请求的“上方”,所以我假设在工作但没有节流的代码周围有一个循环。

paragraphs = list()
with BaseThrottler(delay=1.5) as throttler:
    throttled_requests = throttler.multi_submit(
        [
            requests.Request(row[6], headers=http_headers, timeout=5)
            for row in rows
        ]
    )

for request in throttled_requests:
    soup = BeautifulSoup(request.get_response().text, 'html.parser')
    div = soup.find('div', attrs={'style': 'padding-left:10px;width:98%'})
    for paragraph in div.find_all('p'):
        paragraphs.append(paragraph.text.strip())