如何使用puppeteer js刮取多级链接?

时间:2018-02-19 10:55:51

标签: node.js web-scraping puppeteer

我正在使用Puppeteer抓取网站页面的行。我有代码来抓取内容并将它们分配给表中每个对象的对象。在每个表格行中,我需要在新页面(puppeteer)中打开一个链接,然后刮取特定元素,然后将其分配给同一个对象,并使用新键将整个对象返回给puppeteer。 Puppeteer怎么可能呢?

async function run() {
    const browser = await puppeteer.launch({
        headless: false
    })
    const page = await browser.newPage()

    await page.goto('https://tokenmarket.net/blockchain/', {waitUntil: 'networkidle0'})
    await page.waitFor(5000)
    var onlink = ''
    var result = await page.$$eval('table > tbody tr .col-actions a:first-child', (els) => Array.from(els).map(function(el) {

        //running ajax requests to load the inner page links.
     $.get(el.children[0].href, function(response) {
            onlink = $(response).find('#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2)').text()
        })



        return {
            icoImgUrl: el.children[0].children[0].children[0].currentSrc,
            icoDate: el.children[2].innerText.split('\n').shift() === 'To be announced' ? null : new Date( el.children[2].innerText.split('\n').shift() ).toISOString(),
            icoName:el.children[1].children[0].innerText,
            link:el.children[1].children[0].children[0].href,
            description:el.children[3].innerText,
            assets :onlink
        }

    }))

    console.log(result)

    UpcomingIco.insertMany(result, function(error, docs) {})


    browser.close()
}

run()

1 个答案:

答案 0 :(得分:5)

如果您尝试并行打开每个ICO页面的新标签页,最终可能会同时加载100多页。

所以你能做的最好的事情就是首先收集网址,然后逐个访问它们。

这也可以使代码保持简单易读。

例如(请参阅我的评论):

    const browser = await puppeteer.launch({ headless: false });

    const page = await browser.newPage();

    await page.goto('https://tokenmarket.net/blockchain/');

    // Gather assets page urls for all the blockchains
    const assetUrls = await page.$$eval('.table-assets > tbody > tr .col-actions a:first-child', assetLinks => assetLinks.map(link => link.href));

    const results = [];

    // Visit each assets page one by one
    for (let assetsUrl of assetUrls) {
        await page.goto(assetsUrl);

        // Now collect all the ICO urls.
        const icoUrls = await page.$$eval('#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2) a', links => links.map(link => link.href));

        // Visit each ICO one by one and collect the data.
        for (let icoUrl of icoUrls) {
            await page.goto(icoUrl);

            const icoImgUrl = await page.$eval('#asset-logo-wrapper img', img => img.src);
            const icoName = await page.$eval('h1', h1 => h1.innerText.trim());
            // TODO: Gather all the needed info like description etc here.

            results.push([{
                icoName,
                icoUrl,
                icoImgUrl
            }]);
        }
    }

    // Results are ready
    console.log(results);

    browser.close();