Question

我正在使用 node.js 和 puppeteer 在 google 页面上进行网络抓取，因此用户输入股票的代码，我连接到 google 搜索 URL，然后我抓取了该股票的变化.但有时它会起作用，有时我会收到错误：错误：评估失败：类型错误：无法读取 null 的属性“textContent”。

我已经尝试过使用 waitForSelector 函数，然后超时，而且使用 waitUntil: "domcontentloaded" 也不起作用。我该怎么办？

这是我的代码示例不起作用：（有 3 个可能的元素，如果变化是向上、向下或零，这就是为什么有 2 个条件）

const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
const page = await browser.newPage();
const ticker = fundParser(fund);
const url = "https://www.google.com/search?q=" + ticker.ticker; //Ticker value could be rztr11, arct11 or rzak11
await page.goto(url,{ waitUntil: "networkidle2"});
console.log("Visiting " + url);

 // scrapes variation text. If positive or zero, the first scrape will be null, so there's a conditional for changing its value to the correct one
var variation = await page.$(
    "#knowledge-finance-wholepage__entity-summary > div > g-card-section > div > g-card-section > div.wGt0Bc > div:nth-child(1) > span.WlRRw.IsqQVc.fw-price-dn > span:nth-child(1)"
);
if (variation == null) {
  variation = await page.$(
    "#knowledge-finance-wholepage__entity-summary > div > g-card-section > div > g-card-section > div.wGt0Bc > div:nth-child(1) > span.WlRRw.IsqQVc.fw-price-up > span:nth-child(1)"
  );
  if (variation == null) {
    variation = await page.$(
    "#knowledge-finance-wholepage__entity-summary > div > g-card-section > div > g-card-section > div.wGt0Bc > div:nth-child(1) > span.WlRRw.IsqQVc.fw-price-nc > span:nth-child(1)"
    );
}}
console.log("Extracting fund variation");
const variationText = await page.evaluate(
  (variation1) => variation1.textContent,
  variation
);
console.log("Extracted:" + variationText);

Answer 1

几件事：

你的选择器太脆弱了：如果谷歌更新你的选择器中的任何东西，它会破坏整个事情。您需要简化选择器。
您不需要重复的 if null 检查，您可以通过用逗号 (,) 连接选择器来传递多个元素。
归根结底，如果您的选择器没有返回任何内容，您需要决定您的应用程序将如何处理该错误状态。

项目 1 - 脆性选择器

对于第一项，您绝对不希望在选择器中的任何地方都使用损坏的类名。这些是 .wGt0Bc、.WlRRw、.IsqQVc 等类。我不知道 google 在底层使用什么技术，但看起来他们正在使用一些 {{3 }} 解决方案，这意味着这些奇怪的类名已完成生成，并且可能会随着时间的推移而改变。因此，将它们用作选择器意味着您的 puppeteer 脚本将需要不断更新。相反，如果您避免在选择器中使用这些，您的 puppeteer 代码将运行更长时间。

我推荐以下选择器：

#knowledge-finance-wholepage__entity-summary .fw-price-dn > span:first-child,
#knowledge-finance-wholepage__entity-summary .fw-price-up > span:first-child,
#knowledge-finance-wholepage__entity-summary .fw-price-nc > span:first-child

由于这些不是生成的，我猜这些类名将保持不变更长时间。

项目 2 - 使用单个选择器

如上所述，您不需要重复调用 page.$()，您只需创建一个可以匹配多个元素 CSS-in-JS 的选择器即可。

项目 3 - 错误处理

最终，您的代码无法正常运行，因为它没有正确处理错误。由您决定如何处理此错误。在您的示例代码中，您只是在注销，因此您可能只是想注销您无法获取此股票代码的价格变化。

把它们放在一起

如果找到元素，the same way you would in CSS 方法返回 ElementHandle，否则抛出错误。因此，我们可以直接使用它而不是 page.$()。

这是一些我能够在本地测试的代码，它们似乎可以工作。

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({ args: ['--no-sandbox'] });
    const page = await browser.newPage();
    const ticker = fundParser(fund);
    const url = 'https://www.google.com/search?q=' + ticker.ticker; //Ticker value could be rztr11, arct11 or rzak11

    console.log('Visiting ' + url);
    await page.goto(url, { waitUntil: 'networkidle2' });
    console.log('Visiting ' + url);

    const variation_selector = `#knowledge-finance-wholepage__entity-summary .fw-price-dn > span:first-child,
    #knowledge-finance-wholepage__entity-summary .fw-price-up > span:first-child,
    #knowledge-finance-wholepage__entity-summary .fw-price-nc > span:first-child`;

    try {
        console.log('Extracting fund variation');

        // By default this has a 30 second (30000 ms) timeout. If no element is found after then, an error is thrown.
        const variation = await page.waitForSelector(variation_selector, { timeout: 30000 });

        const variationText = await page.evaluate(
            (variation1) => variation1.textContent,
            variation
        );
        console.log('Extracted: ' + variationText);
    } catch (err) {
        console.error('No variation element could be found.');
    }

    await browser.close();
})();

或者，您也可以尝试获取某段内容的整个文本，然后单独解析它，而不是尝试解析 DOM 的各个部分。

例如：

const knowledge_summary_selector = '#knowledge-finance-wholepage__entity-summary > div > g-card-section';
let knowledge_summary_inner_text;
try {
    const knowledge_summary = await page.waitForSelector(knowledge_summary_selector);
    
    /**
     * Example value for `knowledge_summary_inner_text`:
     * "Market Summary > FI Imobiliario Riza Terrax unica\n106.05 BRL\n0.00 (0.00%)\nFeb 12, 6:06 PM GMT-3 ·Disclaimer\nBVMF: RZTR11\nFollow"
     */
    knowledge_summary_inner_text = await page.evaluate(
        (element) => element.innerText.toString().trim(),
        knowledge_summary
    );

    // Now, parse your `knowledge_summary_inner_text` via some means
    const knowledge_summary_pieces = knowledge_summary_inner_text.split('\n');
    // etc...
} catch (err) {
    console.error('...');
}

在这里，knowledge_summary_inner_text 看起来像：

Market Summary > FI Imobiliario Riza Terrax unica
106.05 BRL
0.00 (0.00%)
Feb 12, 6:06 PM GMT-3 ·Disclaimer
BVMF: RZTR11
Follow

现在这个内容可能更容易解析，比如在 .split('\n') 和一些 page.waitForSelector() 之后。

Puppeteer - 将空内容返回到我的抓取

1 个答案:

项目 1 - 脆性选择器

项目 2 - 使用单个选择器

项目 3 - 错误处理

把它们放在一起