为了为具有动态加载内容的页面创建一个抓取工具,requests-html
提供了一些模块,用于在执行JS之后获取呈现的页面。但是,当尝试通过在多线程实现中调用AsyncHTMLSession
方法来使用arender()
时,生成的HTML不会改变。
例如在源代码中提供的URL中,表HTML值默认情况下为空,并且在脚本执行后,通过arender()
方法进行仿真,尽管未注意到任何明显的变化,但仍希望将这些值插入标记中。源代码。
from pprint import pprint
#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor
from requests_html import AsyncHTMLSession, HTML
async def fetch(session, url):
r = await session.get(url)
await r.html.arender()
return r.content
def parseWebpage(page):
print(page)
async def get_data_asynchronous():
urls = [
'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
]
with ThreadPoolExecutor(max_workers=20) as executor:
with AsyncHTMLSession() as session:
# Set any session parameters here before calling `fetch`
# Initialize the event loop
loop = asyncio.get_event_loop()
# Use list comprehension to create a list of
# tasks to complete. The executor will run the `fetch`
# function for each url in the urlslist
tasks = [
await loop.run_in_executor(
executor,
fetch,
*(session, url) # Allows us to pass in multiple arguments to `fetch`
)
for url in urls
]
# Initializes the tasks to run and awaits their results
for response in await asyncio.gather(*tasks):
parseWebpage(response)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_data_asynchronous())
loop.run_until_complete(future)
main()
答案 0 :(得分:0)
执行渲染方法后的源代码表示形式不在会话的content
属性下,而是在HTML对象中的raw_html
下。在这种情况下,返回的值应为r.html.raw_html
。