Question

我无法下载图片。我有几个问题（我尝试了很多变化）。这是我的代码（我猜它有很多错误）

目标是抓取起始网址并保存所有产品图片，并按SKU编号更改其名称。此外，蜘蛛必须点击“下一个按钮”在所有页面中执行相同的任务（大约有24.000个产品）

我注意到的问题是：

我不知道项目管道的确切配置
图片无法在settings.py
我想通过分辨率过滤图像并使用缩略图。哪一个是推荐的配置？
图像位于另一台服务器上。这是一个问题？

SETTINGS.PY

BOT_NAME = 'soarimages'

SPIDER_MODULES = ['soarimages.spiders']
NEWSPIDER_MODULE = 'soarimages.spiders'
DEFAULT_ITEM_CLASS = 'soarimages.items'
ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
IMAGES_STORE = '/soarimages/images'

ITEMS.PY

import scrapy

class soarimagesItem(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

PIPELINES.PY

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline

class soarimagesPipeline(ImagesPipeline):

def set_filename(self, response):
    #add a regex here to check the title is valid for a filename.
    return 'full/{0}.jpg'.format(response.meta['title'][0])

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(image_url, meta={'title': item['title']})

def get_images(self, response, request, info):
    for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
        key = self.set_filename(response)
    yield key, image, buf

Productphotos.PY（蜘蛛）

# import the necessary packages
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from soarimages.items import soarimagesItem

class soarimagesSpider(scrapy.Spider):
name = 'productphotos'
allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]

def parse(self, response):
    SECTION_SELECTOR = '.one-prod'
    for soarimages in response.css(SECTION_SELECTOR):
        image = soarimagesItem()
        image['title'] = response.xpath('.//p[@class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
        rel = response.xpath('//div/a/img/@data-original').extract_first()
        image['image_urls'] = ['http:'+rel[0]]
        yield image

    NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

Answer 1

这是我的代码（我猜它有很多错误）

事实上，我发现至少有一个错误：allowed_domains应该只列出域名。您不得包含任何http://前缀：

allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']

您可能想要修复此问题并测试您的蜘蛛。如果出现新问题，请针对每个特定问题创建具体问题。这样可以更轻松地为您提供帮助。另请参阅how to ask

无法使用Scrapy下载图像

1 个答案: