我有一段I / O绑定代码,基本上是为我的研究项目进行网络抓取。
代码开始势在必行,然后成为列表理解,现在大多数变成了生成器:
if __name__ == '__main__':
while True:
with suppress(Exception):
page = requests.get(baseUrl).content
urls = (baseUrl + link['href'] for link in BeautifulSoup(page,'html.parser').select('.tournament a'))
resources = (scrape_host(url) for url in urls)
keywords = ((keywords_for_resource(referer, site_id), rid) for
referer, site_id, rid in resources)
output = (scrape(years, animals) for years, animals in keywords)
responses = (post_data_with_exception_handling(list(data)) for data in output)
for response in responses:
print(response.status_code)
这种代码非常适合我的头脑,因为它基于生成器,没有存储很多状态,我想我可以很容易地把它变成基于asyncio
的代码:
async def fetch(session, url):
with async_timeout.timeout(10):
async with session.get(url) as response:
return await response.text()
async def main(loop):
async with aiohttp.ClientSession(loop=loop) as session:
page = await fetch(session,baseUrl)
urls = (baseUrl + link['href'] for link in BeautifulSoup(page,'html.parser').select('.tournament a'))
subpages = (await fetch(session,url) for url in urls)
但是在Python 3.5中,这只返回一个Syntax error
,因为在理解中不允许使用await
表达式。
Python 3.6承诺实现asynchronous generators in pep 530。
此功能是否可以让我轻松地将基于生成器的代码转换为asyncio
代码,还是需要完全重写?
答案 0 :(得分:0)
asyncio.as_completed()
可能是更好的解决方案:
# pip install beautifulsoup4 aiohttp
import asyncio
from urllib.parse import urljoin
import aiohttp
import async_timeout
from bs4 import BeautifulSoup
BASE_URL = "http://www.thewebsiteyouarescraping.com/"
SELECTOR = ".tournament a"
async def fetch(session, url):
with async_timeout.timeout(10):
async with session.get(url) as response:
return url, await response.text()
async def main(base_url, selector, loop):
async with aiohttp.ClientSession(loop=loop) as session:
_, page = await fetch(session, base_url)
urls = (urljoin(base_url, link['href']) for link in
BeautifulSoup(page, 'html.parser').select(selector))
tasks = {fetch(session, url): url for url in urls}
for fut in asyncio.as_completed(tasks, loop=loop):
process(*await fut)
# Compare with:
# for fut in tasks:
# process(*await fut)
def process(url, page):
print(url, len(page))
loop = asyncio.get_event_loop()
loop.run_until_complete(main(BASE_URL, SELECTOR, loop))
loop.close()