项目加载器跳过值scrapy

时间:2019-01-21 19:32:51

标签: python scrapy

我正在使用从多个页面抓取的项目加载器,该项目加载器为某些页面返回空字典,但是当我使用相同规则仅解析这些页面时,它返回值,有人知道为什么吗?

蜘蛛代码:

class AllDataSpider(scrapy.Spider):

    name = 'all_data'  # spider name
    allowed_domains = ['amazon.com']

    # write the start url
    start_urls = ["https://www.amazon.com/s? bbn=2619533011&rh=n%3A2619533011%2Cp_n_availability%3A2661601011&ie=UTF8&qid =1541604856&ref=lp_2619533011_nr_p_n_availability_1"]

    custom_settings = {'FEED_URI': 'pets_.csv'}  # write csv file name

    def parse(self, response):
        '''
        function parses item information from category page
        '''

        self.category = response.xpath('//span[contains(@class, "nav-a- 
                                       content")]//text()').extract_first()

        urls = response.xpath('//*[@data-asin]//@data-asin').extract()

        for url in urls:
            base = f"https://www.amazon.com/dp/{url}"
            yield scrapy.Request(base, callback=self.parse_item)

        next_page = response.xpath('//* 
                                  [text()="Next"]//@href').extract_first()
        if next_page is not None:
            yield scrapy.Request(response.urljoin(next_page), 
                                 dont_filter=True)

    def parse_item(self, response):
        loader = AmazonDataLoader(selector=response)
        loader.add_xpath("Availability", '//div[contains(@id, 
                         "availability")]//span//text()')
        loader.add_xpath("NAME", '//h1[@id="title"]//text()')
        loader.add_xpath("ASIN", '//*[@data-asin]//@data-asin')
        loader.add_xpath("REVIEWS", '//span[contains(@id, 
                          "Review")]//text()')
        rank_check = response.xpath('//*[@id="SalesRank"]//text()')
        if len(rank_check) > 0:
            loader.add_xpath("RANKING", '//*[@id="SalesRank"]//text()')

        else:
            loader.add_xpath("RANKING", '//span//span[contains(text(), "#")] 
                             [1]//text()')

        loader.add_value("CATEGORY", self.category)

        return loader.load_item()

对于某些页面,它返回所有值,对于某些页面,它仅返回类别,对于其他“仅在解析它们时遵循相同规则”的页面,则不返回任何内容,并且在完成操作之前也将蜘蛛关闭,并且没有错误

 DEBUG: Scraped from <200 https://www.amazon.com/dp/B0009X29WK>
{'ASIN': 'B0009X29WK',
 'Availability': 'In Stock.',
 'NAME': " Dr. Elsey's Cat Ultra Premium Clumping Cat Litter, 40 pound bag ( "
         'Pack May Vary ) ',
 'RANKING': '#1',
 'REVIEWS': '13,612'}
2019-01-21 21:13:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/dp/B01N9KSITZ> (referer: https://www.amazon.com/s?i=pets&bbn=2619533011&rh=n%3A2619533011%2Cp_n_availability%3A2661601011&lo=grid&page=2&ie=UTF8&qid=1548097190&ref=sr_pg_1)
2019-01-21 21:13:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/dp/B01N9KSITZ>
{}

0 个答案:

没有答案