抓取抓取未提取数据

时间:2020-02-29 01:46:54

标签: web-scraping scrapy scrapy-pipeline

我正在尝试从BestBuy抓取评论,如果在shell上逐行执行代码而不是通过script来执行代码,则提取的代码很好。怎么了?

class BestbuybotSpider(scrapy.Spider):
    name = 'bestbuybot'
    allowed_domains = ['https://www.bestbuy.com/site/amazon-echo-dot-3rd-gen-smart-speaker-with-alexa-charcoal/6287974.p?skuId=6287974']
    start_urls = ['http://https://www.bestbuy.com/site/amazon-echo-dot-3rd-gen-smart-speaker-with-alexa-charcoal/6287974.p?skuId=6287974/']


def parse(self, response):
        #Extracting the content using css selectors
        rating = response.css("div.c-ratings-reviews-v2.v-small p::text").extract()
        title = response.css(".review-title.c-section-title.heading-5.v-fw-medium  ::text").extract()

        #Give the extracted content row wise
        for item in zip(rating,title):
            #create a dictionary to store the scraped info
            scraped_info = {
                'rating' : item[0],
                'title' : item[1],
            }

            #yield or give the scraped info to scrapy
            yield scraped_info

Console Image

1 个答案:

答案 0 :(得分:0)

您的代码存在一些问题

  1. allowed_domains应该是域,而不是URL。
  2. 您的起始URL的URL方案有问题,即起始URL为'http://https:

如您所见,该抓爬虫将重定向到图像中的finder.cox.net,因此该蜘蛛不会到达该页面,但会显示一个国家/地区选择页面,即重定向。

您应该先尝试使用正确的页面位置修复起始网址,然后蜘蛛程序似乎可以正常工作。