无法使用Scrapy下载图像

时间:2017-04-29 19:18:04

标签: python web-scraping scrapy

我无法下载图片。我有几个问题(我尝试了很多变化)。这是我的代码(我猜它有很多错误)

目标是抓取起始网址并保存所有产品图片,并按SKU编号更改其名称。此外,蜘蛛必须点击“下一个按钮”在所有页面中执行相同的任务(大约有24.000个产品)

我注意到的问题是:

  1. 我不知道项目管道的确切配置
  2. 图片无法在settings.py
  3. 中的文件夹中下载
  4. 我想通过分辨率过滤图像并使用缩略图。哪一个是推荐的配置?
  5. 图像位于另一台服务器上。这是一个问题?
  6. SETTINGS.PY

    BOT_NAME = 'soarimages'
    
    SPIDER_MODULES = ['soarimages.spiders']
    NEWSPIDER_MODULE = 'soarimages.spiders'
    DEFAULT_ITEM_CLASS = 'soarimages.items'
    ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
    IMAGES_STORE = '/soarimages/images'
    

    ITEMS.PY

    import scrapy
    
    class soarimagesItem(scrapy.Item):
        title = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()
    

    PIPELINES.PY

    import scrapy
    from scrapy.contrib.pipeline.images import ImagesPipeline
    
    class soarimagesPipeline(ImagesPipeline):
    
    def set_filename(self, response):
        #add a regex here to check the title is valid for a filename.
        return 'full/{0}.jpg'.format(response.meta['title'][0])
    
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url, meta={'title': item['title']})
    
    def get_images(self, response, request, info):
        for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
            key = self.set_filename(response)
        yield key, image, buf
    

    Productphotos.PY(蜘蛛)

    # import the necessary packages
    import scrapy
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy.linkextractors import LinkExtractor
    from soarimages.items import soarimagesItem
    
    class soarimagesSpider(scrapy.Spider):
    name = 'productphotos'
    allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
    start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
    rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]
    
    def parse(self, response):
        SECTION_SELECTOR = '.one-prod'
        for soarimages in response.css(SECTION_SELECTOR):
            image = soarimagesItem()
            image['title'] = response.xpath('.//p[@class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
            rel = response.xpath('//div/a/img/@data-original').extract_first()
            image['image_urls'] = ['http:'+rel[0]]
            yield image
    
        NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )
    

1 个答案:

答案 0 :(得分:0)

  

这是我的代码(我猜它有很多错误)

事实上,我发现至少有一个错误:allowed_domains应该只列出域名。您不得包含任何http://前缀:

allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']

您可能想要修复此问题并测试您的蜘蛛。如果出现新问题,请针对每个特定问题创建具体问题。这样可以更轻松地为您提供帮助。另请参阅how to ask