我已经使用 pyppeteer 在python中创建了一个脚本,以收集从网站遍历多个页面的不同机构的名称。我想做的是让我的脚本在分析每个页面的名称的同时单击“下一页”按钮来滚动不同的页面。
我尝试过的事情:
import asyncio
from pyppeteer import launch
url = "https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx"
async def fetch_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
while True:
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
for item in await page.querySelectorAll("h1.faqsno-heading"):
name = await item.querySelectorEval("div[id^='arrowex']",'e => e.innerText')
print(name)
try:
elem = await page.querySelector("[title='Next Page']")
await elem.click()
except Exception: break
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_table(url))
上面的脚本可以正常工作,直到遇到5到10页之间的错误为止。页面可能会有所不同。
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 23, in <module>
loop.run_until_complete(fetch_table(url))
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
return future.result()
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 11, in fetch_table
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 834, in __await__
raise result
pyppeteer.errors.TimeoutError: Waiting for selector "h1.faqsno-heading" failed: timeout 30000ms exceeds.
但是,当我进行较小的更改并尝试这样操作时,我可以看到脚本也可以正常工作,直到遇到以下错误:
try:
await page.click("[title='Next Page']")
except Exception: break
我收到以下错误:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 48, in <module>
loop.run_until_complete(fetch_table(url))
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
return future.result()
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 37, in fetch_table
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 832, in __await__
result = yield from self.promise
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 859, in rerun
*self._args,
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 239, in _rewriteError
raise error
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error Runtime.callFunctionOn: Target closed.
如何让我的脚本继续运行直到完成所有点击?
答案 0 :(得分:0)
请注意,您要抓取的网站有数百页!我不想我的系统长期陷入困境
运行过程。相反,我尝试了slot = 20页,它似乎正在工作。您可以更改插槽数以进行实验。
我正在使用python 3.6,websockets 6.0。我在Windows 8.1上。
我添加了几行代码以限制页面数。除此之外,我还添加了
await page.waitForSelector("[title='Next Page']", {'visible':True})
在几个地方。
这是代码
import asyncio
from pyppeteer import launch
url = "https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx"
async def fetch_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
slots=20 # change here for number of pages you want to scrape
i=0
while True:
i=i+1
if(i>slots):
await page.waitForSelector("[title='Next Page']", {'visible':True})
break
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
for item in await page.querySelectorAll("h1.faqsno-heading"):
name = await item.querySelectorEval("div[id^='arrowex']",'e => e.innerText')
print(name)
try:
await page.waitForSelector("[title='Next Page']", {'visible':True})
elem = await page.querySelector("[title='Next Page']")
await elem.click()
except Exception: break
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_table(url))
-输出约20页
(testenv) C:\Py\pypuppeteer1>python stack5.py
....
....
SHREE SUBRAHMANYA VANGMAYEE PARISHAD, GOAAAPTS2410M
SHREE SUBRAHMANYA VANGMAYEE PARISHAD, GOAAAPTS2410M
WORD FOR THE WORLD FELLOWSHIPAAAAW6295Q
JANA SEVA TRUSTAACTJ0594Q
VAGDEVI VILAS EDUCATIONAL AND CHARITABLE TRUSTAABTV8264G
NCORE IMPACT FOUNDATIONAAFCN9985K
M V M EDUCATIONAL TRUSTAACTM5633K
SOCIETY FOR BETTERMENT OF EDUCATIONAAHAS9354D
SWASTIKAM CHARITABLE TRUSTAAJTS9298K
M/S SANKALP YUVA PRERIT SANVARDHAN BAHUUDDESHIYA SANSTHAAAITS8452J
TRAILOKYA BOUDHA MAHASANGHA SAHAYYAK GAN NAGPURAAABT2581K
MISSIONAL YATRA INDIA (MY INDIA) CHARITABLE TRUSTAAOTM9109M
VRUNDAVAN SHIKSHAN VA BAHUUDDESHIYA SANSTHAAABAV6403C
SHRI JAGDAMBA GOVIGYAN ANUSANDHAN KENDRAAAQTS8474C
SUSHILABAI DEUSKAR PRATISHTHANAALTS8647L
AMRAVATI DISTRICT OPTHALMIC SOCIETYAAETA8499F
ALUMNI ASSOCIATION OF INDIRA GANDHI GOVERNMENT MEDICAL COLLEGE NAGPURAAGTA1367C
VIDYA NIDHI NAGPURAABTN4351L
LATE RAJSINGH DUNGAPUR MEMORIAL FOUNDATIONAABTL5457B
ARTHIK DRUSTYA MAGASVARGIYA SAMAJ SHIKSHAN SANSTHAAACTA6288L
SPARSHAADAS4064Q
LATE PADMADEVI R. MALOO FOUNDATIONAAATL4181B
VISHWARACHNA GRAMINS VIKAS SANSTHAAAATV5359D