Question

我正在使用puppeteer来评估我的测试应用中基于javascript的网页HTML。

这是我用来确保加载所有数据的行：

await page.setRequestInterception(true);
page.on("request", (request) => {
  if (request.resourceType() === "image" || request.resourceType() === "font" || request.resourceType() === "media") {
    console.log("Request intercepted! ", request.url(), request.resourceType());
    request.abort();
  } else {
    request.continue();
  }
});
try {
  await page.goto(url, { waitUntil: ['networkidle0', 'load'], timeout: requestCounterMaxWaitMs });
} catch (e) {

}

这是等待 ajax请求完成的最佳方法吗？

感觉不对，但我不确定是否应该使用networkidle0，networkidle1等？

Answer 1

XHR本质上可以在应用程序的后期出现。如果应用程序在例如1秒之后发送XHR并且您想等待它，则任何networkidle0都无法帮助您。我想如果你想“正确地”做到这一点，你应该知道你在等待什么，await。

以下是应用程序中稍后发生XHR的示例，它等待所有这些：

const puppeteer = require('puppeteer');

const html = `
<html>
  <body>
    <script>
      setTimeout(() => {
        fetch('https://swapi.co/api/people/1/');
      }, 1000);

      setTimeout(() => {
        fetch('https://www.metaweather.com/api/location/search/?query=san');
      }, 2000);

      setTimeout(() => {
        fetch('https://api.fda.gov/drug/event.json?limit=1');
      }, 3000);
    </script>
  </body>
</html>`;

// you can listen to part of the request
// in this example I'm waiting for all of them
const requests = [
    'https://swapi.co/api/people/1/',
    'https://www.metaweather.com/api/location/search/?query=san',
    'https://api.fda.gov/drug/event.json?limit=1'
];

const waitForRequests = (page, names) => {
  const requestsList = [...names];
  return new Promise(resolve =>
     page.on('request', request => {
       if (request.resourceType() === "xhr") {
         // check if request is in observed list
         const index = requestsList.indexOf(request.url());
         if (index > -1) {
           requestsList.splice(index, 1);
         }

         // if all request are fulfilled
         if (!requestsList.length) {
           resolve();
         }
       }
       request.continue();
     })
  );
};


(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setRequestInterception(true);

  // register page.on('request') observables
  const observedRequests = waitForRequests(page, requests);

  // await is ignored here because you want to only consider XHR (ajax) 
  // but it's not necessary
  page.goto(`data:text/html,${html}`);

  console.log('before xhr');
  // await for all observed requests
  await observedRequests;
  console.log('after all xhr');
  await browser.close();
})();

Answer 2

您可以使用pending-xhr-puppeteer，这是一个公开承诺的库，等待所有待处理的xhr请求得到解决。

像这样使用它：

const puppeteer = require('puppeteer');
const { PendingXHR } = require('pending-xhr-puppeteer');

const browser = await puppeteer.launch({
  headless: true,
  args,
});

const page = await browser.newPage();
const pendingXHR = new PendingXHR(page);
await page.goto(`http://page-with-xhr`);
// Here all xhr requests are not finished
await pendingXHR.waitForAllXhrFinished();
// Here all xhr requests are finished

免责声明：我是pending-xhr-puppeteer的维护者

Answer 3

我同意this answer中的观点，即等待 all 网络活动停止（“所有数据均已加载”）是一个相当模糊的概念，它完全取决于行为您要抓取的网站。

用于检测响应的选项包括等待固定的持续时间，网络流量空闲后的固定持续时间，特定的响应（或一组响应），元素出现在页面上，谓词返回true等。全部Puppeteer supports。

考虑到这一点，最典型的情况是，您正在等待来自已知（或使用模式或前缀的部分已知）资源URL的某些特定响应或一组响应，这些URL将提供有效负载您想要读取和/或触发您需要检测的DOM交互。 Puppeteer为此提供了page.waitForResponse。

下面是一个基于existing post的示例（并展示了如何在响应时从响应中检索数据）：

{{1}}

有没有办法让木偶的等待者＆＃34; networkidle＆＃34;只考虑XHR（ajax）请求？

3 个答案: