Question

我已经使用puppeteer在节点中编写了一些小的脚本，以循环点击 website 的目标页面上不同帖子的链接。 / p>

我的脚本中使用的站点链接是一个占位符。而且，它们不是动态的。因此，木偶戏可能会过分杀伤。但是，我的目的是学习点击的逻辑。

当我执行第一个脚本时，它单击一次，并在其脱离源代码时引发以下错误。

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({headless:false});
    const [page] = await browser.pages();
    await page.goto("https://stackoverflow.com/questions/tagged/web-scraping",{waitUntil:'networkidle2'});
    await page.waitFor(".summary");
    const sections = await page.$$(".summary");

    for (const section of sections) {
        await section.$eval(".question-hyperlink", el => el.click())
    }

    await browser.close();
})();

上面的脚本遇到的错误：

(node:9944) UnhandledPromiseRejectionWarning: Error: Execution context was destroyed, most likely because of a navigation.

当我执行以下命令时，脚本假装单击一次（实际上不是这样），并且遇到与先前相同的错误。

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({headless:false});
    const [page] = await browser.pages();
    await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");

    await page.waitFor(".summary .question-hyperlink");
    const sections = await page.$$(".summary .question-hyperlink");

    for (let i=0, lngth = sections.length; i < lngth; i++) {
        await sections[i].click();
    }

    await browser.close();
})();

以上错误引发的错误：

(node:10128) UnhandledPromiseRejectionWarning: Error: Execution context was destroyed, most likely because of a navigation.

如何让我的脚本周期性地执行点击？

Answer 1

问题：

执行上下文被破坏，很可能是由于导航的原因。

该错误表明您想单击某个链接，或在某个页面上执行某项不存在的操作，这很可能是因为您离开了浏览器。

逻辑：

将伪装者脚本视为浏览真实页面的真实人类。

首先，我们加载网址（https://stackoverflow.com/questions/tagged/web-scraping）。

接下来，我们要浏览该页面上提出的所有问题。为此，我们通常会做什么？我们将执行以下任一操作，

在新标签页中打开一个链接。专注于该新标签，完成我们的工作，然后回到原始标签。继续下一个链接。
我们点击链接，执行我们的工作，返回上一页，继续下一个。

因此，它们都涉及离开当前页面并返回到当前页面。

如果不遵循此流程，则会收到上述错误消息。

解决方案

至少有4种或更多方法可以解决此问题。我将介绍最简单和最复杂的那些。

方法：链接提取

首先，我们提取当前页面上的所有链接。

const links = await page.$$eval(".hyperlink", element => element.href);

这为我们提供了网址列表。我们可以为每个链接创建一个新标签。

for(let link of links){
  const newTab = await browser.newPage();
  await newTab.goto(link);
  // do the stuff
  await newTab.close();
}

这将逐个通过每个链接。我们可以通过使用promise.map和各种队列库来改善这一点，但是您明白了。

方法：返回首页

我们将需要以某种方式存储状态，以便我们可以知道上次访问哪个链接。如果我们访问了第三个问题并返回标签页，则下次需要访问第四个问题，反之亦然。

检查以下代码。

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto(
    `https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pagesize=15`
  );

  const visitLink = async (index = 0) => {
    await page.waitFor("div.summary > h3 > a");

    // extract the links to click, we need this every time
    // because the context will be destryoed once we navigate
    const links = await page.$$("div.summary > h3 > a");
    // assuming there are 15 questions on one page,
    // we will stop on 16th question, since that does not exist
    if (links[index]) {
      console.log("Clicking ", index);

      await Promise.all([

        // so, start with the first link
        await page.evaluate(element => {
          element.click();
        }, links[index]),

        // either make sure we are on the correct page due to navigation
        await page.waitForNavigation(),
        // or wait for the post data as well
        await page.waitFor(".post-text")
      ]);

      const currentPage = await page.title();
      console.log(index, currentPage);

      // go back and visit next link
      await page.goBack({ waitUntil: "networkidle0" });
      return visitLink(index + 1);
    }
    console.log("No links left to click");
  };

  await visitLink();

  await browser.close();
})();

结果：

编辑：有多个与此类似的问题。如果您想了解更多信息，我会引用它们。

Answer 2

与周期性地单击所有链接相比，我发现最好解析所有链接，然后使用相同的浏览器导航到每个链接。试一试：

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({headless:false});
    const [page] = await browser.pages();
    const base = "https://stackoverflow.com"
    await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
    let links = [];
    await page.waitFor(".summary .question-hyperlink");
    const sections = await page.$$(".summary .question-hyperlink");

    for (const section of sections) {
        const clink = await page.evaluate(el=>el.getAttribute("href"), section);
        links.push(`${base}${clink}`);
    }

    for (const link of links) {
        await page.goto(link);
        await page.waitFor('h1 > a');
    }
    await browser.close();
})();

麻烦使用puppeteer单击其他链接

2 个答案:

问题：

逻辑：

解决方案

方法：链接提取

方法：返回首页