我已经使用multiprocessing.pool.ThreadPool
用python编写了一个脚本来同时处理多个请求,并且使抓取过程更强大。解析器做得很好。
正如我在多个脚本中所注意到的那样,使用 multiprocessing 创建抓取工具时,抓取过程应该有一个延迟,我想在我的内部设置一个延迟下面的脚本。
但是,这是我遇到的问题,无法找到合适的位置来放置该延迟。
到目前为止,这是我的脚本:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
url = "http://srar.com/roster/index.php?agent_search=a"
def get_links(link):
completelinks = []
res = requests.get(link)
soup = BeautifulSoup(res.text,'lxml')
for items in soup.select("table.border tr"):
if not items.select("td a[href^='index.php?agent']"):continue
data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
completelinks.extend(data)
return completelinks
def get_info(nlink):
req = requests.get(nlink)
sauce = BeautifulSoup(req.text,"lxml")
for tr in sauce.select("table[style$='1px;'] tr")[1:]:
table = [td.get_text(strip=True) for td in tr.select("td")]
print(table)
if __name__ == '__main__':
ThreadPool(20).map(get_info, get_links(url))
再次:我只需要知道脚本中的正确位置即可延迟。
答案 0 :(得分:1)
要在requests.get()
内的多个get_info
呼叫之间放置延迟,您必须使用延迟参数扩展get_info
,该参数可以作为time.sleep()
个电话。由于所有工作线程都立即启动,因此每次调用的延迟都必须累积。意思是,当您希望requests.get()
调用之间的延迟为0.5秒时,传递给池方法的延迟列表将看起来像[0.0,0.5,1.0,1.5,2.0,2.5 .. 。]。
由于不必更改get_info
本身,我在下面的示例中使用修饰符来扩展带有延迟参数和get_info
调用的time.sleep(delay)
。请注意,我在get_info
调用中将延迟传递给pool.starmap
的另一个参数。
import logging
from multiprocessing.pool import ThreadPool
from functools import wraps
def delayed(func):
@wraps(func)
def wrapper(delay, *args, **kwargs):
time.sleep(delay) # <--
return func(*args, **kwargs)
return wrapper
@delayed
def get_info(nlink):
info = nlink + '_info'
logger.info(msg=info)
return info
def get_links(n):
return [f'link{i}' for i in range(n)]
def init_logging(level=logging.DEBUG):
fmt = '[%(asctime)s %(levelname)-8s %(threadName)s' \
' %(funcName)s()] --- %(message)s'
logging.basicConfig(format=fmt, level=level)
if __name__ == '__main__':
DELAY = 0.5
init_logging()
logger = logging.getLogger(__name__)
links = get_links(10) # ['link0', 'link1', 'link2', ...]
delays = (x * DELAY for x in range(0, len(links)))
arguments = zip(delays, links) # (0.0, 'link0'), (0.5, 'link1'), ...
with ThreadPool(10) as pool:
result = pool.starmap(get_info, arguments)
print(result)
示例输出:
[2018-10-03 22:04:14,221 INFO Thread-8 get_info()] --- link0_info
[2018-10-03 22:04:14,721 INFO Thread-5 get_info()] --- link1_info
[2018-10-03 22:04:15,221 INFO Thread-3 get_info()] --- link2_info
[2018-10-03 22:04:15,722 INFO Thread-4 get_info()] --- link3_info
[2018-10-03 22:04:16,223 INFO Thread-1 get_info()] --- link4_info
[2018-10-03 22:04:16,723 INFO Thread-6 get_info()] --- link5_info
[2018-10-03 22:04:17,224 INFO Thread-7 get_info()] --- link6_info
[2018-10-03 22:04:17,723 INFO Thread-10 get_info()] --- link7_info
[2018-10-03 22:04:18,225 INFO Thread-9 get_info()] --- link8_info
[2018-10-03 22:04:18,722 INFO Thread-2 get_info()] --- link9_info
['link0_info', 'link1_info', 'link2_info', 'link3_info', 'link4_info',
'link5_info', 'link6_info', 'link7_info', 'link8_info', 'link9_info']