木偶错误:页面崩溃!试图获取具有完全加载的正文的页面

时间:2019-11-27 08:35:35

标签: javascript node.js npm web-scraping puppeteer

当我用requestaxios刮擦时,它体内什么都没有。

<!DOCTYPE html><html>
<head>
<!--Deleted Head content -->
</head>
<body>
    <ui-root></ui-root>
<script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/inline.bundle.js"></script><script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/polyfills.bundle.js"></script><script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/vendor.bundle.js"></script><script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/main.bundle.js"></script>
</body>
</html>

我想获取一个完全加载的正文的HTML代码。因此,我尝试实现人偶。

在节点版本Puppeteer上运行v10.15.3。 这是我的木偶代码:

const browser = await puppeteer.launch({
      args: [
        "--no-sandbox", 
        "--disable-setuid-sandbox", 
        "--user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"]
    })
const page = await browser.newPage()
await page.goto(sourceUrl)
htmlCode = await page.evaluate(() => document.body.innerHTML)
console.log(htmlCode)

我尝试做:

await page.goto(sourceUrl, {waitUntil:"networkidle0"})

await page.waitForFunction('window.status==="ready"')

await page.waitFor(5000)

它们似乎都不起作用。结果是空的正文/超时/页面崩溃

这是错误消息

(node:966) UnhandledPromiseRejectionWarning: Error: Page crashed!
    at Page._onTargetCrashed (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Page.js:216:24)
    at CDPSession.Page.client.on.event (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Page.js:124:56)
    at CDPSession.emit (events.js:189:13)
    at CDPSession.EventEmitter.emit (domain.js:441:20)
    at CDPSession._onMessage (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Connection.js:200:12)
    at Connection._onMessage (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Connection.js:112:17)
    at WebSocketTransport._ws.addEventListener.event (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/WebSocketTransport.js:44:24)
    at WebSocket.onMessage (/mnt/c/users/junsoo/desktop/pikk/node_modules/ws/lib/event-target.js:120:16)
    at WebSocket.emit (events.js:189:13)
    at WebSocket.EventEmitter.emit (domain.js:441:20)

我正在尝试抓取此页面:https://www.29cm.co.kr/product/178591

1 个答案:

答案 0 :(得分:0)

我花了几天时间调试此错误,而我的解决方案是使用以下参数启动pupeteer:

const launchOptions = {
    ignoreHTTPSErrors: true,
    args: [
      "--unlimited-storage",
      "--full-memory-crash-report",
      "--disable-gpu",
      "--ignore-certificate-errors",
      "--no-sandbox",
      "--disable-setuid-sandbox",
      "--disable-dev-shm-usage",
      "--lang=en-US;q=0.9,en;q=0.8",
      "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    ],
  };
  const browser = await puppeteer.launch(launchOptions);