我无法下载图片。我有几个问题(我尝试了很多变化)。这是我的代码(我猜它有很多错误)
目标是抓取起始网址并保存所有产品图片,并按SKU编号更改其名称。此外,蜘蛛必须点击“下一个按钮”在所有页面中执行相同的任务(大约有24.000个产品)
我注意到的问题是:
SETTINGS.PY
BOT_NAME = 'soarimages'
SPIDER_MODULES = ['soarimages.spiders']
NEWSPIDER_MODULE = 'soarimages.spiders'
DEFAULT_ITEM_CLASS = 'soarimages.items'
ITEM_PIPELINES = {'soarimages.pipelines.soarimagesPipeline': 1}
IMAGES_STORE = '/soarimages/images'
ITEMS.PY
import scrapy
class soarimagesItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
PIPELINES.PY
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
class soarimagesPipeline(ImagesPipeline):
def set_filename(self, response):
#add a regex here to check the title is valid for a filename.
return 'full/{0}.jpg'.format(response.meta['title'][0])
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'title': item['title']})
def get_images(self, response, request, info):
for key, image, buf in super(soarimagesPipeline, self).get_images(response, request, info):
key = self.set_filename(response)
yield key, image, buf
Productphotos.PY
(蜘蛛)
# import the necessary packages
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from soarimages.items import soarimagesItem
class soarimagesSpider(scrapy.Spider):
name = 'productphotos'
allowed_domains = ['http://sodimac.com.ar','http://sodimacar.scene7.com']
start_urls = ['http://www.sodimac.com.ar/sodimac-ar/search/']
rules = [Rule(LinkExtractor(allow=['http://sodimacar.scene7.com/is/image//SodimacArgentina/.*']), 'parse')]
def parse(self, response):
SECTION_SELECTOR = '.one-prod'
for soarimages in response.css(SECTION_SELECTOR):
image = soarimagesItem()
image['title'] = response.xpath('.//p[@class="sku"]/text()').re_first(r'SKU:\s*(.*)').strip(),
rel = response.xpath('//div/a/img/@data-original').extract_first()
image['image_urls'] = ['http:'+rel[0]]
yield image
NEXT_PAGE_SELECTOR = 'a.next ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
答案 0 :(得分:0)
这是我的代码(我猜它有很多错误)
事实上,我发现至少有一个错误:allowed_domains
应该只列出域名。您不得包含任何http://
前缀:
allowed_domains = ['sodimac.com.ar', 'sodimacar.scene7.com']
您可能想要修复此问题并测试您的蜘蛛。如果出现新问题,请针对每个特定问题创建具体问题。这样可以更轻松地为您提供帮助。另请参阅how to ask