我正在尝试学习如何使用Python的asyncio模块同时运行任务。在下面的代码中,我有一个模拟的“网络爬虫”作为示例。基本上,我试图使其在任何给定时间最多发生两个活动的fetch()请求的地方,并且我希望在sleep()期间调用process()。
import asyncio
class Crawler():
urlq = ['http://www.google.com', 'http://www.yahoo.com',
'http://www.cnn.com', 'http://www.gamespot.com',
'http://www.facebook.com', 'http://www.evergreen.edu']
htmlq = []
MAX_ACTIVE_FETCHES = 2
active_fetches = 0
def __init__(self):
pass
async def fetch(self, url):
self.active_fetches += 1
print("Fetching URL: " + url);
await(asyncio.sleep(2))
self.active_fetches -= 1
self.htmlq.append(url)
async def crawl(self):
while self.active_fetches < self.MAX_ACTIVE_FETCHES:
if self.urlq:
url = self.urlq.pop()
task = asyncio.create_task(self.fetch(url))
await task
else:
print("URL queue empty")
break;
def process(self, page):
print("processed page: " + page)
# main loop
c = Crawler()
while(c.urlq):
asyncio.run(c.crawl())
while c.htmlq:
page = c.htmlq.pop()
c.process(page)
但是,上面的代码会一个接一个地下载URL(一次不是两个),并且直到获取完所有URL后才进行任何“处理”。我该如何使fetch()任务同时运行,并使之在sleep()期间在之间调用process()?
答案 0 :(得分:2)
您的crawl
方法在每个任务完成后都在等待;您应该将其更改为:
async def crawl(self):
tasks = []
while self.active_fetches < self.MAX_ACTIVE_FETCHES:
if self.urlq:
url = self.urlq.pop()
tasks.append(asyncio.create_task(self.fetch(url)))
await asyncio.gather(*tasks)
EDIT :这是带有注释的更干净的版本,它可以同时提取和处理所有注释,同时保留了限制提取器最大数量的基本功能。
import asyncio
class Crawler:
def __init__(self, urls, max_workers=2):
self.urls = urls
# create a queue that only allows a maximum of two items
self.fetching = asyncio.Queue()
self.max_workers = max_workers
async def crawl(self):
# DON'T await here; start consuming things out of the queue, and
# meanwhile execution of this function continues. We'll start two
# coroutines for fetching and two coroutines for processing.
all_the_coros = asyncio.gather(
*[self._worker(i) for i in range(self.max_workers)])
# place all URLs on the queue
for url in self.urls:
await self.fetching.put(url)
# now put a bunch of `None`'s in the queue as signals to the workers
# that there are no more items in the queue.
for _ in range(self.max_workers):
await self.fetching.put(None)
# now make sure everything is done
await all_the_coros
async def _worker(self, i):
while True:
url = await self.fetching.get()
if url is None:
# this coroutine is done; simply return to exit
return
print(f'Fetch worker {i} is fetching a URL: {url}')
page = await self.fetch(url)
self.process(page)
async def fetch(self, url):
print("Fetching URL: " + url);
await asyncio.sleep(2)
return f"the contents of {url}"
def process(self, page):
print("processed page: " + page)
# main loop
c = Crawler(['http://www.google.com', 'http://www.yahoo.com',
'http://www.cnn.com', 'http://www.gamespot.com',
'http://www.facebook.com', 'http://www.evergreen.edu'])
asyncio.run(c.crawl())
答案 1 :(得分:1)
您可以将htmlq
设为asyncio.Queue()
,并将htmlq.append
更改为htmlq.push
。然后您的main
可以是异步的,就像这样:
async def main():
c = Crawler()
asyncio.create_task(c.crawl())
while True:
page = await c.htmlq.get()
if page is None:
break
c.process(page)
您的顶级代码归结为对asyncio.run(main())
的调用。
完成抓取后,crawl()
可以将None
放入队列,以通知主协程该工作已经完成。