我想使用trustpilot.com网站收集(与puppeeter一起)一家公司的所有评论,但是我有一个小问题:我无法获得评论的内容!
这是我的代码:
const puppeteer = require("puppeteer")
const getData = async () => {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
slowMo: 1000
})
const page = await browser.newPage()
await page.goto("https://www.trustpilot.fr/review/ovh.com", {waitUntil: 'networkidle2'})
await page.click("#onetrust-accept-btn-handler")
const result = await page.evaluate(() => {
window.scrollBy(0, window.innerHeight)
let score = parseFloat(document.querySelector("body > main > div > div.placeholder > div > div.company-profile-header > section.company-summary > div > div.right-section > div > div.trustscore_container > p").innerText.replace(',', '.'))
let reviewsElements = document.querySelectorAll("body > main > div > div.company-profile-body > section > div.review-list > div.review-card")
let reviews = []
reviewsElements.forEach(reviewElement => {
reviews.push({
title: reviewElement.querySelector("a.link").innerHTML,
content: reviewElement.querySelector("p").innerHTML,
user: reviewElement.querySelector("div.consumer-information__name").innerHTML.split("\n ")[1],
stars: parseInt(reviewElement.innerHTML.match(/(?<=\"stars\"\:)\d+/)[0]),
date: reviewElement.innerHTML.match(/(?<=datetime=")\S+(?=\")/)[0]
})
});
return {
score: score,
reviews: reviews
}
})
browser.close()
return result
}
getData().then(value => {
console.log(value)
})
问题在这一行:
content: reviewElement.querySelector("p").innerHTML,
Stacktrace:
[2020-08-20 13:17:49]: ERROR Unhandled rejection: Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at reviewsElements.forEach.reviewElement (<anonymous>:9:58)
at NodeList.forEach (<anonymous>)
at <anonymous>:6:25
Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at reviewsElements.forEach.reviewElement (<anonymous>:9:58)
at NodeList.forEach (<anonymous>)
at <anonymous>:6:25
at ExecutionContext.evaluateHandle (/home/container/node_modules/puppeteer/lib/ExecutionContext.js:88:13)
at async ExecutionContext.evaluate (/home/container/node_modules/puppeteer/lib/ExecutionContext.js:46:20)
at async getData (/home/container/index.js:95:20)
问题是当我输出reviewElement.innerHTML时存在p元素 谢谢您的帮助!
此致, Arnaud L。
答案 0 :(得分:1)
由于来自 30吉尔的评论。 2020 ,它没有任何内容(因此,不包含<p>
段落元素)。
如果在启用reviewsElements.forEach(reviewElement => { console.log(reviewElement .querySelector('p')); });
的同时检查headfull
,将得到以下输出:
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
null
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
...
因此,显然,在访问<p>
的内部HTML之前,您必须检查null:
reviewsElements.forEach(reviewElement => {
reviews.push({
title: reviewElement.querySelector("a.link").innerHTML,
content: !!reviewElement.querySelector("p") ? reviewElement.querySelector("p").innerHTML : "",
user: reviewElement.querySelector("div.consumer-information__name").innerHTML.split("\n ")[1],
stars: parseInt(reviewElement.innerHTML.match(/(?<=\"stars\"\:)\d+/)[0]),
date: reviewElement.innerHTML.match(/(?<=datetime=")\S+(?=\")/)[0]
})
});