我希望从网站输出每个评论的正文。我得到第一页的正确输出,但是如果有4页评论,那么我会从第一页得到4次文字。如何确保刮板每次都移至下一页?
import lxml.html as html
import math
import csv
import requests
import re
import time
# Trustpilot review page
basePage = 'http://www.trustpilot.com/review/'
reviewSite = 'www.boo-hoo.com'
reviewPage = basePage + reviewSite
# Data file to save to
datafile = 'datascrap.csv'
# Trustpilot default
resultsPerPage = 20
print('Scraper set for ' + reviewPage + ' - saving result to ' + datafile)
# Get page, skipping HTTPS as it gives certificate errors
page = requests.get(reviewPage, verify=False)
tree = html.fromstring(page.content)
# Total amount of ratings
ratingCount = tree.xpath('//h2[@class="header--inline"]')
ratingCount = ratingCount[0].text.replace(',','')
ratingCount = ratingCount.replace(u'\xa0', u'')
ratingCount = ratingCount.replace(u'\n', u'')
ratingCount = ratingCount.replace(u'Average', u'')
ratingCount = ratingCount.replace(u' ', '')
ratingCount = ratingCount.replace(u'•', '')
ratingCount = ratingCount.replace(u'Great', '')
ratingCount = int(ratingCount)
# Amount of chunks to consider for displaying processing output
# For ex. 10 means output progress for every 10th of the data
tot_chunks = 20
# Throttling to avoid spamming page with requests
# With sleepTime seconds between every page request
throttle = True
sleepTime = 2
# Total pages to scrape
pages = math.ceil(ratingCount / resultsPerPage)
print('Found total of ' + str(pages) + ' pages to scrape')
with open(datafile, 'w', newline='', encoding='utf8') as csvfile:
# Tab delimited to allow for special characters
datawriter = csv.writer(csvfile, delimiter='\t')
print('Processing..')
for i in range(1, pages + 1):
if (throttle): time.sleep(sleepTime)
page = requests.get(reviewPage + '?page=' + str(i))
tree = html.fromstring(page.content)
# The item below scrapes a review body.
bodies = tree.xpath('//p[@class="review-content__text"]')
for idx, e in enumerate(bodies):
# Progress counting, outputs for every processed chunk
reviewNumber = idx + 20 * (i - 1) + 1
chunk = int(ratingCount / tot_chunks)
if reviewNumber % chunk == 0:
print('Processed ' + str(reviewNumber) + '/' + str(ratingCount) + ' ratings')
# Body of comment
body = e.text_content().strip()
datawriter.writerow([body])
print('Processed ' + str(ratingCount) + '/' + str(ratingCount) + ' ratings.. Finished!')
例如,如果该站点有80条评论,那么我将获得前20条四次,但是当我每次尝试打印该页面时,它都显示为1、2、3等。
答案 0 :(得分:2)
reviewSite
不正确。从reviewSite = 'www.boo-hoo.com'
更改为reviewSite = 'boo-hoo.com'
如果您在浏览器中转到第2页,则会看到:
https://www.trustpilot.com/review/boo-hoo.com?page=2
但是您正在串联www.boo-hoo.com
,因此它错误地尝试转到:
https://www.trustpilot.com/review/www.boo-hoo.com?page=2
然后默认为首页