当我用request
或axios
刮擦时,它体内什么都没有。
<!DOCTYPE html><html>
<head>
<!--Deleted Head content -->
</head>
<body>
<ui-root></ui-root>
<script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/inline.bundle.js"></script><script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/polyfills.bundle.js"></script><script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/vendor.bundle.js"></script><script type="text/javascript" src="https://d13fzx7h5ezopb.cloudfront.net/www/v479/product/main.bundle.js"></script>
</body>
</html>
我想获取一个完全加载的正文的HTML代码。因此,我尝试实现人偶。
在节点版本Puppeteer
上运行v10.15.3
。
这是我的木偶代码:
const browser = await puppeteer.launch({
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"]
})
const page = await browser.newPage()
await page.goto(sourceUrl)
htmlCode = await page.evaluate(() => document.body.innerHTML)
console.log(htmlCode)
我尝试做:
await page.goto(sourceUrl, {waitUntil:"networkidle0"})
await page.waitForFunction('window.status==="ready"')
await page.waitFor(5000)
它们似乎都不起作用。结果是空的正文/超时/页面崩溃
这是错误消息
(node:966) UnhandledPromiseRejectionWarning: Error: Page crashed!
at Page._onTargetCrashed (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Page.js:216:24)
at CDPSession.Page.client.on.event (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Page.js:124:56)
at CDPSession.emit (events.js:189:13)
at CDPSession.EventEmitter.emit (domain.js:441:20)
at CDPSession._onMessage (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Connection.js:200:12)
at Connection._onMessage (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/Connection.js:112:17)
at WebSocketTransport._ws.addEventListener.event (/mnt/c/users/junsoo/desktop/pikk/node_modules/puppeteer/lib/WebSocketTransport.js:44:24)
at WebSocket.onMessage (/mnt/c/users/junsoo/desktop/pikk/node_modules/ws/lib/event-target.js:120:16)
at WebSocket.emit (events.js:189:13)
at WebSocket.EventEmitter.emit (domain.js:441:20)
我正在尝试抓取此页面:https://www.29cm.co.kr/product/178591
答案 0 :(得分:0)
我花了几天时间调试此错误,而我的解决方案是使用以下参数启动pupeteer:
const launchOptions = {
ignoreHTTPSErrors: true,
args: [
"--unlimited-storage",
"--full-memory-crash-report",
"--disable-gpu",
"--ignore-certificate-errors",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--lang=en-US;q=0.9,en;q=0.8",
"--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
],
};
const browser = await puppeteer.launch(launchOptions);