我正在爬行网站“ https://www.imdb.com/title/tt4695012/reviews?ref_=tt_ql_3”。我需要的数据是上述网站的评论和评分。 我只能抓取2页。 但是我想要网站所有页面上的评论和评分。
下面是我尝试过的代码
我在start_urls中包含了多个网站。
class RatingSpider(Spider):
name = "rate"
start_urls = ["https://www.imdb.com/title/tt4695012/reviews?ref_=tt_ql_3"]
def parse(self, response):
ratings = response.xpath("//div[@class='ipl-ratings-bar']//span[@class='rating-other-user-rating']//span[not(contains(@class, 'point-scale'))]/text()").getall()
texts = response.xpath("//div[@class='text show-more__control']/text()").getall()
result_data = []
for i in range(0, len(ratings)):
row = {}
row["ratings"] = int(ratings[i])
row["review_text"] = texts[i]
result_data.append(row)
print(json.dumps(row))
next_page = response.xpath("//div[@class='load-more-data']").xpath("@data-key").extract()
next_url = response.urljoin("reviews/_ajax?ref_=undefined&paginationKey=")
next_url = next_url + next_page[0]
if next_page is not None and len(next_page) != 0:
yield scrapy.Request(next_url, callback=self.parse)
请帮助我抓取网站的所有页面。
答案 0 :(得分:1)
您对next_page的网址有疑问。如果您继续保持起始网址并将其用于所有下一页,则将获得所有评论数据。检查此解决方案:
import scrapy
from urlparse import urljoin
class RatingSpider(scrapy.Spider):
name = "rate"
start_urls = ["https://www.imdb.com/title/tt4695012/reviews?ref_=tt_ql_3"]
def parse(self, response):
ratings = response.xpath("//div[@class='ipl-ratings-bar']//span[@class='rating-other-user-rating']//span[not(contains(@class, 'point-scale'))]/text()").getall()
texts = response.xpath("//div[@class='text show-more__control']/text()").getall()
result_data = []
for i in range(len(ratings)):
row = {
"ratings": int(ratings[i]),
"review_text": texts[i]
}
result_data.append(row)
print(json.dumps(row))
key = response.css("div.load-more-data::attr(data-key)").get()
orig_url = response.meta.get('orig_url', response.url)
next_url = urljoin(orig_url, "reviews/_ajax?paginationKey={}".format(key))
if key:
yield scrapy.Request(next_url, meta={'orig_url': orig_url})