Scrapy;试图在亚马逊上刮以下页面

时间:2020-04-13 00:16:07

标签: python web-scraping scrapy

im试图从该亚马逊产品The link is here中抓取所有评论。但是,它仅呈现第一页的结果。

Snapshot of the first page result

下面是我在scrapy框架中的代码。

import scrapy
from..items import AmazonItem

class SpideramazonSpider(scrapy.Spider):
name = 'spideramazon'
allowed_domains = ['amazon.co.uk, amazon.com']
start_urls = ['https://www.amazon.com/Apple-MacBook-MC700LL-13-3-Inch-VERSION/product-reviews/B002QQ8H8I/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews']


def parse(self, response):
    items= AmazonItem()
    getpage = response.css('div[data-hook=review]')
    for get_data in getpage:
        id = get_data.xpath('@id').extract()
        title = get_data.xpath('.//a[@data-hook="review-title"]/span/text()').extract()
        author_name = get_data.css('span.a-profile-name::text').extract()
        review_text = '\n'.join(get_data.xpath('.//span[@data-hook="review-body"]/span/text()').extract())
        stars = self.extract_stars(get_data)
        review_date = get_data.css('span.review-date::text').extract_first()

        items['id'] = id
        items['title'] = title
        items['author_name'] = author_name
        items['review_text'] = review_text
        items['stars'] = stars
        items['review_date'] = review_date
        yield items
    next_page = response.css('li.a-last a::attr(href)').get()
    if next_page is not None:
        next_page=response.urljoin(next_page)
        yield response.follow(url=next_page, callback=self.parse)

def extract_stars(self, get_data):
    stars = None
    star_classes = get_data.css('i.a-icon-star::attr(class)').extract_first().split(' ')
    for i in star_classes:
        if i.startswith('a-star-'):
            stars = int(i[7:])
            break
    return stars

非常不熟悉,不胜感激!

1 个答案:

答案 0 :(得分:0)

您需要获取以下页面的链接,将它们添加到需要解析的链接队列中,并继续检查该队列中是否有更多页面需要抓取。

您可以在此处了解有关此技术的更多信息:https://medium.com/quick-code/python-scrapy-tutorial-for-beginners-03-how-to-go-to-the-next-page-d29827e0544b