我已经使用 pyppeteer 在python中创建了一个脚本,以从网页中收集不同帖子的链接,然后通过以下方式解析每个帖子的标题:进入目标页面,重新使用那些收集的链接。尽管内容是静态的,但我想知道pyppeteer在这种情况下如何工作。
我试图将此browser
变量从main()
函数提供给fetch()
和browse_all_links()
函数,以便可以重复使用同一浏览器。
我目前的做法:
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(main())
上面的脚本获取了一些标题,但是在执行过程中的某个时刻吐出了以下错误:
Possible to select <a> with specific text within the quotes?
Crawler Runs Too Slow
How do I loop a list of ticker to scrape balance sheet info?
How to retrive the url of searched video from youtbe using python
VBA-JSON to import data from all pages in one table
Is there an algorithm that detects semantic visual blocks in a webpage?
find_all only scrape the last value
#ERROR STARTS
Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Runtime.releaseObject): Cannot find context with specified id')>
pyppeteer.errors.NetworkError: Protocol error (Runtime.releaseObject): Cannot find context with specified id
Future exception was never retrieved
答案 0 :(得分:1)
自此问题发布以来已经过了两天,但尚未有人回答,我将借此机会解决这个问题, 认为可能对您有帮助。
有15个链接,但您只有7个链接,这可能是websockets失去了连接,页面无法访问
列表理解
tasks = [await browse_all_links(page,url) for url in links]
该列表有什么用?如果成功,它将是一个列表
没有元素。因此,您的下一行代码将引发错误!
解决方案
将websockets 7.0降级为websockets 6.0
删除此行代码await asyncio.gather(*tasks)
我正在使用python 3.6,因此我不得不更改最后一行代码。 如果您使用的是我认为使用
import asyncio
from pyppeteer import launch
url = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
await page.waitForSelector('.summary .question-hyperlink')
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
return linkstorage
async def browse_all_links(page,link):
await page.goto(link)
await page.waitForSelector('h1 > a')
title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
print(title)
async def main():
browser = await launch(headless=False,autoClose=False)
[page] = await browser.pages()
links = await fetch(page,url)
tasks = [await browse_all_links(page,url) for url in links]
#await asyncio.gather(*tasks)
await browser.close()
if __name__ == '__main__':
#asyncio.run(main())
asyncio.get_event_loop().run_until_complete(main())
(testenv) C:\Py\pypuppeteer1>python stack3.py
Scrapy Shell response.css returns an empty array
Scrapy real-time spider
Why do I get KeyError while reading data with get request?
Scrapy spider can't redefine custom_settings according to args
Custom JS Script using Lua in Splash UI
Can someone explain why and how this piece of code works [on hold]
How can I extract required data from a list of strings?
Scrapy CrawlSpider rules for crawling single page
how to scrap a web-page with search bar results, when the search query does not
appear in the url
Nested for loop keeps repeating
Get all tags except a list of tags BeautifulSoup
Get current URL using Python and webbot
How to login to site and send data
Unable to append value to colums. Getting error IndexError: list index out of ra
nge
NextSibling.Innertext not working. “Object doesn't support this property”