Question

我希望从网站输出每个评论的正文。我得到第一页的正确输出，但是如果有4页评论，那么我会从第一页得到4次文字。如何确保刮板每次都移至下一页？

import lxml.html as html
import math
import csv
import requests
import re
import time

# Trustpilot review page
basePage = 'http://www.trustpilot.com/review/'
reviewSite = 'www.boo-hoo.com'
reviewPage = basePage + reviewSite

# Data file to save to
datafile = 'datascrap.csv'

# Trustpilot default
resultsPerPage = 20

print('Scraper set for ' + reviewPage + ' - saving result to ' + datafile)

# Get page, skipping HTTPS as it gives certificate errors
page = requests.get(reviewPage, verify=False)
tree = html.fromstring(page.content)

# Total amount of ratings
ratingCount = tree.xpath('//h2[@class="header--inline"]')
ratingCount = ratingCount[0].text.replace(',','')
ratingCount = ratingCount.replace(u'\xa0', u'')
ratingCount = ratingCount.replace(u'\n', u'')
ratingCount = ratingCount.replace(u'Average', u'')
ratingCount = ratingCount.replace(u' ', '')
ratingCount = ratingCount.replace(u'•', '')
ratingCount = ratingCount.replace(u'Great', '')
ratingCount = int(ratingCount)

# Amount of chunks to consider for displaying processing output
# For ex. 10 means output progress for every 10th of the data
tot_chunks = 20

# Throttling to avoid spamming page with requests
# With sleepTime seconds between every page request
throttle = True
sleepTime = 2

# Total pages to scrape
pages = math.ceil(ratingCount / resultsPerPage)
print('Found total of ' + str(pages) + ' pages to scrape')

with open(datafile, 'w', newline='', encoding='utf8') as csvfile:
    # Tab delimited to allow for special characters
    datawriter = csv.writer(csvfile, delimiter='\t')
    print('Processing..')
    for i in range(1, pages + 1):

        if (throttle): time.sleep(sleepTime)

        page = requests.get(reviewPage + '?page=' + str(i))
        tree = html.fromstring(page.content)

        # The item below scrapes a review body.
        bodies = tree.xpath('//p[@class="review-content__text"]')

        for idx, e in enumerate(bodies):
            # Progress counting, outputs for every processed chunk
            reviewNumber = idx + 20 * (i - 1) + 1
            chunk = int(ratingCount / tot_chunks)
            if reviewNumber % chunk == 0:
                print('Processed ' + str(reviewNumber) + '/' + str(ratingCount) + ' ratings')

            # Body of comment
            body = e.text_content().strip()
            datawriter.writerow([body])
    print('Processed ' + str(ratingCount) + '/' + str(ratingCount) + ' ratings.. Finished!')

例如，如果该站点有80条评论，那么我将获得前20条四次，但是当我每次尝试打印该页面时，它都显示为1、2、3等。

Answer 1

reviewSite不正确。从reviewSite = 'www.boo-hoo.com'更改为reviewSite = 'boo-hoo.com'

如果您在浏览器中转到第2页，则会看到：

https://www.trustpilot.com/review/boo-hoo.com?page=2

但是您正在串联www.boo-hoo.com，因此它错误地尝试转到：

https://www.trustpilot.com/review/www.boo-hoo.com?page=2

然后默认为首页

使用网络刮板时，如何确保刮掉第一页后又将刮掉第二页？

1 个答案: