无法从scrapy网站下载图像

时间:2015-12-07 02:49:53

标签: python scrapy scrapy-spider scrapy-pipeline

我从Scrapy开始,以便自动化网站上的文件下载。作为测试,我想从this网站下载jpg文件。我的代码基于Sc {the intro tutorial和Scrapy网站上的Files and Images Pipeline tutorial

我的代码是:

在settings.py中,我添加了这些行:

module Spree
  CheckoutController.class_eval do
    def before_address
      # if the user has a default address, a callback takes care of setting
      # that; but if he doesn't, we need to build an empty one here
      if current_user.phone_number.present?
        @order.bill_address ||= Address.build_default
        @order.ship_address ||= Address.build_default if @order.checkout_steps.include?('delivery')
      else
        # some error telling that you need to fill the phone number
        redirect_to registration_path
      end
    end
  end
end

我的items.py文件是:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

IMAGES_STORE = '/home/lucho/Scrapy/jpg/'

我的管道文件是:

import scrapy

class JpgItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    pass

最后,我的蜘蛛文件是:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class JpgPipeline(object):
    def process_item(self, item, spider):
        return item
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

理想情况下,我想下载所有jpg,而不指定所需的每个文件的确切网址

“scrapy crawl jpg”的输出是:

import scrapy
from .. items import JpgItem

class JpgSpider(scrapy.Spider):
    name = "jpg"
    allowed_domains = ["http://www.kevinsmedia.com"]
    start_urls = [
        "http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/"
    ]

def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.parse)

def parse(self, response):
    item = JpgItem()
    return item

虽然似乎没有错误,但程序没有检索jpg文件。如果重要,我正在使用Ubuntu。

1 个答案:

答案 0 :(得分:0)

您尚未在parse()课程中定义JpgSpider

更新。现在我可以在更新中看到URL,这看起来不像是你应该用scrapy攻击的问题。 WGET可能更合适,请查看answers here。特别是,请查看顶部答案的第一条评论,了解如何使用文件扩展名来限制下载哪些文件(-A jpg)。

更新2:parse()例程可以使用此代码从<a>标记获取专辑封面网址

part_urls = response.xpath('//a[contains(., "AlbumArt")]/@href')

这将返回部分URL列表,您需要为要从response.url解析的页面添加根URL。我看过的网址中有几个%的代码,它们可能有问题,但无论如何都要尝试。获得完整网址列表后,请将其添加到项目[]

item['image_urls'] = full_urls
yield item

这应该让scrapy自动下载图像,这样你就可以删除你的中间件并让scrapy完成繁重的工作。