Puppeteer / Nodejs正确的功能结构,似乎没有执行

时间:2019-10-07 16:25:07

标签: node.js puppeteer

我正在尝试学习puppeteer / nodejs。我遵循了一些教程,并运行了基本的刮板: 1.它进入一个网站。 2.滚动到底部 3.创建一个pdf。

我要抓取的网站具有评论功能(每个帖子可能有或没有评论)。

我已经创建了一个名为expandAllComments的函数(下面的代码),但它似乎也没有达到我的期望,而是扩展了所有注释。

在学习时,我是console.log语句的忠实拥护者,并且在函数中添加了几个(这是我在Python中要做的,也许不适用于此处)。

主要scraper.js包含以下几行:

async function processSite(browser) {
  console.log("Processing web site")
  const page = await browser.newPage()
  await page.goto('https://somewebsite.com’ , { waitUntil: ['networkidle2'] })

  //do some scrolling
  await delay(4000);

  try {
  await page.autoScroll()
  } catch (e) 
  {console.log(`Error occurred while scrolling timeline: ${e}`)
  debug(`Error occurred while scrolling timeline: ${e}`)
  }

  await delay(1000)


  await delay(4000)

  //expand all the comments
  try {
    await page.expandAllComments()
    } catch (e) 
    {console.log(`Error occurred while scrolling timeline: ${e}`)
    debug(`Error occurred while scrolling timeline: ${e}`)
    }  
  await delay(2000)

  await page.scrollToTop()


  //save to pdf
  //await page.pdf({ format: 'letter', path: "test.pdf", printBackground: true });  //commented out so I can watch in non-headless mode.  Save to pdf doesn’t work in non-headless


}

helper / plugin文件(index.js)中的expandallcomments()具有以下文件(在console.log上确实很重-抱歉!)。

此外,我在控制台中验证了查询选择器。

async expandAllComments () {

    console.log("This is a print statement from within the expandAllComments function")
    await this.evaluate(async () => {
      await new Promise((resolve) => {
        let link_clicker = () => {
          let lc = 0
          document
            .querySelectorAll('._commentToggle')
            .forEach(el => {
              {
                delay(10000);
                el.click()
                lc++
              }
            })
          return lc
        }

        let timer = setTimeout(function click_links () {
          let links_clicked = link_clicker()
          if (links_clicked) {
            timer = setTimeout(click_links, 1000)
          } else {
            clearTimeout(timer)
            resolve()
          }
        }, 10000)
      })
    })
  }

我期望在重新发布的pdf文件中(当我以无头模式运行它时),看到的注释会扩大,但不会发生。

,当我观察console.log输出时,这就是我看到的:

Processing web site
Waiting before scrolling
Finished Waiting
Scrolling 
This is a print statement from within the autoscroll function
Finished the infinite scrollng.
Waiting before moving on

Waiting before moving on
about to expand comments
This is a print statement from within the expandAllComments function
all comments expanded
About to scroll to top
Scrolled to top

expandAllComments()仅打印console.log语句之一(也许这只是nodejs函数的工作方式,我不确定)

我想我想弄清楚如何使其单击每个选择器上的注释,以便可以将它们包含在pdf中。

提前谢谢!

更新

我试图稍微简化和清理代码,然后尝试@ md-abu-taher建议。

新代码

//const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs-extra');
// MY OCD of colorful console.logs for debugging... IT HELPS
const error = chalk.bold.red;
const success = chalk.keyword("green");

const puppeteer = require('puppeteer-extra')
puppeteer.use(require('puppeteer-extra-plugin-puppeteer-helper')())

function delay(time) {
    return new Promise(function(resolve) {
        setTimeout(resolve, time)
    });
 }

async function processSite(browser) {
    console.log("Processing web site")
    const page = await browser.newPage()
    await page.goto('somesite' , { waitUntil: ['networkidle2'] })

    await delay(10000)
    await autoScroll(page);
    await delay(10000)

    console.log("about to look for element")
    const comments = await page.$$('._commentToggleBtn')

    await delay(10000)
    for (let comment of comments) {
        await comment.click()
        console.log("Loop click")
       await page.waitFor(10000)
    }
    await page.scrollToTop()
    await page.pdf({ format: 'letter', path: "test.pdf", printBackground: true })
    await page.close()

}

async function autoScroll(page){
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            var totalHeight = 0
            var distance = 100
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight
                window.scrollBy(0, distance)
                totalHeight += distance

                if(totalHeight >= scrollHeight){
                    clearInterval(timer)
                    resolve()
                }
            }, 400)
        })
    })
}

(async () => {
    const browser = await puppeteer.launch({headless: true,
      defaultViewport: {
        width: 1366,
        height: 768,
      },
      args: [
        '--disable-dev-shm-usage', '--no-sandbox', '--disable-setuid-sandbox',
      ]
    })
    await processSite(browser)

    console.log("Finished processing site")

})()

输出

Processing web site
about to scroll entire site
scrolled entire site
about to look for element
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Loop click
Finished clicking
Scrolling to top
Scrolled to top
saving to pdf
saved to pdf
Closing page
Finished processing site

您可以看到它找到并单击了12个元素,但还有更多。当我以GUI模式观看浏览器时,这12个元素位于网站的底部页面。我可以看到它实际上是在元素上单击并展开注释,但是它不会递归遍历所有页面。

任何帮助将不胜感激。谢谢!

1 个答案:

答案 0 :(得分:0)

问题:

  • 检查delay是否在page.evaluate内部可用。
  • .forEach.click()由于它们的异步特性而不能很好地协同工作。

解决方案:

使用page。$$(selector)获取要单击的元素,并让... of循环递归地单击它们。您还将在节点上下文中使用delaypage.waitFor

下面是示例代码,

const elements = await page.$$(selector);
for(let elementHandle of elements){
  await elementHandle.click()
  await page.waitFor(10000)
}