如何使用yield函数从多个页面抓取数据

时间:2019-04-11 09:34:49

标签: scrapy scrapy-splash

我正在尝试从Amazon India网站抓取数据。在以下情况下,我无法使用yield()方法收集响应并解析元素: 1)我必须从产品页面转到评论页面 2)我必须从一个评论页面转到另一个评论页面

Product page

Review page

代码流:

1)customerReviewData()调用getCustomerRatingsAndComments(response)

2)getCustomerRatingsAndComments(response) 找到评论页面的URL并使用getCrrFromReviewPage(request)作为回调方法调用带有该评论页面的URL的yield请求方法

3)getCrrFromReviewPage()获得firstreview页面的新响应,并从第一个评论页面(已加载页面)中抓取所有元素,并将其添加到customerReviewDataList []

4)获取下一页的URL(如果存在),然后递归调用getCrrFromReviewPage()方法,并从下一页开始抓取元素,直到抓取所有审阅页面为止

5)所有评论都添加到customerReviewDataList []

我尝试过使用yield()来更改参数,并且还查找了scrapy文档中的yield()和请求/响应yield

# -*- coding: utf-8 -*-
import scrapy
import logging

customerReviewDataList = []
customerReviewData = {}

#Get product name in <H1>
def getProductTitleH1(response):
    titleH1 =  response.xpath('normalize-space(//*[@id="productTitle"]/text())').extract()
    return titleH1

def getCustomerRatingsAndComments(response):
    #Fetches the relative url
    reviewRelativePageUrl = response.css('#reviews-medley-footer a::attr(href)').extract()[0]
    if reviewRelativePageUrl:
        #get absolute URL
        reviewPageAbsoluteUrl = response.urljoin(reviewRelativePageUrl)
        yield Request(url = reviewPageAbsoluteUrl, callback = getCrrFromReviewPage())
        self.log("yield request complete")

    return len(customerReviewDataList)

def getCrrFromReviewPage():

    userReviewsAndRatings = response.xpath('//div[@id="cm_cr-review_list"]/div[@data-hook="review"]')


    for userReviewAndRating in userReviewsAndRatings:
        customerReviewData[reviewTitle] = response.css('#cm_cr-review_list .review-title span ::text').extract()
        customerReviewData[reviewDescription] = response.css('#cm_cr-review_list .review-text span::text').extract()
        customerReviewDataList.append(customerReviewData) 

    reviewNextPageRelativeUrl = response.css('#cm_cr-pagination_bar .a-pagination .a-last a::attr(href)')[0].extract()

    if reviewNextPageRelativeUrl:
        reviewNextPageAbsoluteUrl = response.urljoin(reviewNextPageRelativeUrl)
        yield Request(url = reviewNextPageAbsoluteUrl, callback = getCrrFromReviewPage())


class UsAmazonSpider(scrapy.Spider):
    name = 'Test_Crawler'
    allowed_domains = ['amazon.in']
    start_urls = ['https://www.amazon.in/Philips-Trimmer-Cordless-Corded-QT4011/dp/B00JJIDBIC/ref=sr_1_3?keywords=philips&qid=1554266853&s=gateway&sr=8-3']

    def parse(self, response):
        titleH1 = getProductTitleH1(response),
        customerReviewData = getCustomerRatingsAndComments(response)

        yield{
        'Title_H1' : titleH1,
        'customer_Review_Data' : customerReviewData
        }


我收到以下答复:

{'Title_H1': (['Philips Beard Trimmer Cordless and Corded for Men QT4011/15'],), 'customer_Review_Data': <generator object getCustomerRatingsAndComments at 0x048AC630>}

“ Customer_review_Data”应该是标题和评论字典的列表

我无法弄清楚我在这里犯了什么错误。

当我使用log()或print()来查看customerReviewDataList []中捕获了什么数据时,也无法在控制台中看到数据。

如果产品页面中存在所有评论,我便可以在customerReviewDataList []中进行抓取,

在这种情况下,我必须使用yield函数,得到的输出如上所述[[https://ibb.co/kq8w6cf]

这是我正在寻找的输出内容

{'customerReviewTitle': ['Difficult to find a charger adapter'],'customerReviewComment': ['I already have a phillips trimmer which was only cordless. ], 'customerReviewTitle': ['Good Product'],'customerReviewComment': ['Solves my need perfectly HK']}]}

感谢您的帮助。预先感谢。

1 个答案:

答案 0 :(得分:1)

您应该完成Scrapy教程。 Following links部分应该对您特别有帮助。

这是您代码的简化版本:

def data_request_iterator():
    yield Request('https://example.org')

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {
            'title': response.css('title::text').get(),
            'data': data_request_iterator(),
        }

它应该看起来像这样:

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        item = {
            'title': response.css('title::text').get(),
        }
        yield Request('https://example.org', meta={'item': item}, callback=self.parse_data)

    def parse_data(self, response):
        item = response.meta['item']
        # TODO: Extend item with data from this second response as needed.
        yield item