如何网上抓取Trustpilot的评论?

时间:2020-08-20 11:36:05

标签: web-scraping puppeteer

我想使用trustpilot.com网站收集(与puppeeter一起)一家公司的所有评论,但是我有一个小问题:我无法获得评论的内容!

这是我的代码:

const puppeteer = require("puppeteer")

const getData = async () => {
    const browser = await puppeteer.launch({
        headless: true,
        args: ['--no-sandbox'],
        slowMo: 1000
    })
    const page = await browser.newPage()

    await page.goto("https://www.trustpilot.fr/review/ovh.com", {waitUntil: 'networkidle2'})
    await page.click("#onetrust-accept-btn-handler")
    const result = await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight)
        let score = parseFloat(document.querySelector("body > main > div > div.placeholder > div > div.company-profile-header > section.company-summary > div > div.right-section > div > div.trustscore_container > p").innerText.replace(',', '.'))
        let reviewsElements = document.querySelectorAll("body > main > div > div.company-profile-body > section > div.review-list > div.review-card")
        let reviews = []
        reviewsElements.forEach(reviewElement => {
            reviews.push({
                title: reviewElement.querySelector("a.link").innerHTML,
                content: reviewElement.querySelector("p").innerHTML,
                user: reviewElement.querySelector("div.consumer-information__name").innerHTML.split("\n ")[1],
                stars: parseInt(reviewElement.innerHTML.match(/(?<=\"stars\"\:)\d+/)[0]),
                date: reviewElement.innerHTML.match(/(?<=datetime=")\S+(?=\")/)[0]
            })
        });

        return {
            score: score,
            reviews: reviews
        }
    })

    browser.close()
    return result
}

getData().then(value => {
    console.log(value)
})

问题在这一行: content: reviewElement.querySelector("p").innerHTML,

Stacktrace:

[2020-08-20 13:17:49]: ERROR Unhandled rejection: Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at reviewsElements.forEach.reviewElement (<anonymous>:9:58)
at NodeList.forEach (<anonymous>)
at <anonymous>:6:25
Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at reviewsElements.forEach.reviewElement (<anonymous>:9:58)
at NodeList.forEach (<anonymous>)
at <anonymous>:6:25
at ExecutionContext.evaluateHandle (/home/container/node_modules/puppeteer/lib/ExecutionContext.js:88:13)
at async ExecutionContext.evaluate (/home/container/node_modules/puppeteer/lib/ExecutionContext.js:46:20)
at async getData (/home/container/index.js:95:20)

问题是当我输出reviewElement.innerHTML时存在p元素 谢谢您的帮助!

此致, Arnaud L。

1 个答案:

答案 0 :(得分:1)

由于来自 30吉尔的评论。 2020 ,它没有任何内容(因此,不包含<p>段落元素)。

如果在启用reviewsElements.forEach(reviewElement => { console.log(reviewElement .querySelector('p')); });的同时检查headfull,将得到以下输出:

<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
null
<p class="review-content__text">...</p>
<p class="review-content__text">...</p>
...

因此,显然,在访问<p>的内部HTML之前,您必须检查null:

reviewsElements.forEach(reviewElement => {
    reviews.push({
        title: reviewElement.querySelector("a.link").innerHTML,
        content: !!reviewElement.querySelector("p") ? reviewElement.querySelector("p").innerHTML : "",
        user: reviewElement.querySelector("div.consumer-information__name").innerHTML.split("\n ")[1],
        stars: parseInt(reviewElement.innerHTML.match(/(?<=\"stars\"\:)\d+/)[0]),
        date: reviewElement.innerHTML.match(/(?<=datetime=")\S+(?=\")/)[0]
    })
});