Question

我正在尝试选择并console.log()在终端中网站的所有链接的NodeList。但是，我无法访问某些网站- google.com ， facebook.com ， instagram.com 。

我知道元素是那里，因为我当然可以将它们记录在实际的Chromium控制台中，该控制台使用document.querySelectorAll('a')单独加载。但是当我试图在Node终端中提取和记录链接时，使用

const links = await page.evaluate(() => document.querySelectorAll('a'))
console.log(links)

我得到undefined

但是，大多数网站并非如此，例如 yahoo.com ， linkedin.com ，我的代码可以在其中运行。在这里：

const URL = 'https://instagram.com/';
const scrape = async () => {
    const browser = await puppeteer.launch({
        headless: false
    });
    const page = await browser.newPage();
    await page.setViewport({
        width: 1240,
        height: 680
    });
    await page.goto(URL, { waitUntil: 'domcontentloaded' });
    await page.waitFor(6000);
    const links = await page.evaluate(() => document.querySelectorAll('a'));
    console.log(links);
    await page.screenshot({
        path: 'ig.png'
    });
    await browser.close();
};

我尝试按照this article中的建议添加bypassBotDetectionSystem()函数，但是没有用。我认为这不是问题所在，因为就像我说的那样，我可以轻松浏览Chromium中的内容。

感谢帮助！

Answer 1

您尝试使用DOM方法返回page.evaluate元素，但这是不可能的，因为如果传递给page.evaluate的函数返回一个non-Serializable值，则根据您的情况，page.evaluate解析为undefined。

如果要获取ElementHandle的数组，可以改用page.$$方法。

示例：

const links = await page.$$('a'); // returns <Promise<Array<ElementHandle>>>

但是，如果您只想获取属性的所有值（例如href），则可以采用page.$$eval方法，该方法在页面内运行Array.from(document.querySelectorAll(selector))并将其作为第一个参数传递给pageFunction

示例：

const hrefs = await page.$$eval('a', links => links.map(link => link.href));
console.log(hrefs);

Answer 2

 const hrefs = await page.$$eval('a', anchors => [].map.call(anchors, a => a.href));

PUPPETEER-无法使用page.evaluate（（）=> document.querySelectorAll（））在某些网站上提取元素

2 个答案: