puppeteer:无法登录并通过URL循环

时间:2019-02-26 13:06:17

标签: node.js asynchronous web-scraping puppeteer

大家好,我想登录一个网站,一旦通过身份验证,便想遍历给定的一组URLS和抓取数据。这个示例可以描述我打算做的事情,但是无论如何,我都会遇到未处理的诺言拒绝。

const puppeteer = require("puppeteer");

list = [
	"https://www.facebook.com/",
	"https://www.google.com/",
	"https://www.zocdoc.com/"
];

const getTitle = async (p, url) => {
    try{
        await p.goto(url);
        const title = await p.title();
        console.log(title);
    }
    catch(e) {
        console.log(e)
    }

    return title
};

(async () => {
	const browser = await puppeteer.launch();
    const page = await browser.newPage();
    console.log(this)
    for (var url of list) {
        getTitle(page, url)
    }
	await browser.close();
})();

1 个答案:

答案 0 :(得分:0)

此示例中存在多个问题。

  1. 您应该等待对getTitle函数的调用,您正在等待在函数内部,但是您也必须等待对函数的调用。

  2. 您应该用try and catch块将getTitle包围起来,并在函数内部检查是否有要返回的标题(例如google的标题为空)

    const puppeteer = require("puppeteer");
    
    list = [
        "https://www.facebook.com/",
        "https://www.google.com/",
        "https://www.zocdoc.com/"
    ];
    
    const getTitle = async (p, url) => {
        try{
            await p.goto(url);
            const title = await p.title();
            if(title){
                return title
            }
        }
        catch(e) {
            throw(e)
            console.log(e)
        }
    };
    
    (async () => {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        console.log(this)
        for (var url of list) {
            try{
                console.log(await  getTitle(page, url))
            }
            catch(e ){
                console.log('No title')
            }
        }
        await browser.close();
    })();