Question

我从Scrapy开始，以便自动化网站上的文件下载。作为测试，我想从this网站下载jpg文件。我的代码基于Sc {the intro tutorial和Scrapy网站上的Files and Images Pipeline tutorial。

我的代码是：

在settings.py中，我添加了这些行：

module Spree
  CheckoutController.class_eval do
    def before_address
      # if the user has a default address, a callback takes care of setting
      # that; but if he doesn't, we need to build an empty one here
      if current_user.phone_number.present?
        @order.bill_address ||= Address.build_default
        @order.ship_address ||= Address.build_default if @order.checkout_steps.include?('delivery')
      else
        # some error telling that you need to fill the phone number
        redirect_to registration_path
      end
    end
  end
end

我的items.py文件是：

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

IMAGES_STORE = '/home/lucho/Scrapy/jpg/'

我的管道文件是：

import scrapy

class JpgItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    pass

最后，我的蜘蛛文件是：

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class JpgPipeline(object):
    def process_item(self, item, spider):
        return item
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

（理想情况下，我想下载所有jpg，而不指定所需的每个文件的确切网址）

“scrapy crawl jpg”的输出是：

import scrapy
from .. items import JpgItem

class JpgSpider(scrapy.Spider):
    name = "jpg"
    allowed_domains = ["http://www.kevinsmedia.com"]
    start_urls = [
        "http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/"
    ]

def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.parse)

def parse(self, response):
    item = JpgItem()
    return item

虽然似乎没有错误，但程序没有检索jpg文件。如果重要，我正在使用Ubuntu。

Answer 1

您尚未在parse()课程中定义JpgSpider。

更新。现在我可以在更新中看到URL，这看起来不像是你应该用scrapy攻击的问题。 WGET可能更合适，请查看answers here。特别是，请查看顶部答案的第一条评论，了解如何使用文件扩展名来限制下载哪些文件（-A jpg）。

更新2：parse（）例程可以使用此代码从<a>标记获取专辑封面网址

part_urls = response.xpath('//a[contains(., "AlbumArt")]/@href')

这将返回部分URL列表，您需要为要从response.url解析的页面添加根URL。我看过的网址中有几个％的代码，它们可能有问题，但无论如何都要尝试。获得完整网址列表后，请将其添加到项目[]

中

item['image_urls'] = full_urls
yield item

这应该让scrapy自动下载图像，这样你就可以删除你的中间件并让scrapy完成繁重的工作。

无法从scrapy网站下载图像

1 个答案: