蜘蛛下载图像似乎没有工作,虽然我已经安装了pil

时间:2013-01-18 05:21:20

标签: scrapy

我是scrapy的新手。我正在写一个蜘蛛下载images.for使用图像管道,安装PIL足够吗?我的PIL位于
/usr/lib/python2.7/dist-packages/PIL

如何将其包含在我的Scrapy项目中?

设置文件:

BOT_NAME = 'paulsmith'
BOT_VERSION = '1.0'

ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGE_STORE = '/home/jay/Scrapy/paulsmith/images'


SPIDER_MODULES = ['paulsmith.spiders']
NEWSPIDER_MODULE = 'paulsmith.spiders'
DEFAULT_ITEM_CLASS = 'paulsmith.items.PaulsmithItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

项目文件:

from scrapy.item import Item, Field

class PaulsmithItem(Item):

    image_urls=Field()  
    image = Field()
    pass

蜘蛛代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from paulsmith.items import PaulsmithItem

class PaulSmithSpider(BaseSpider):
    name="Paul"
    allowed_domains=["http://www.paulsmith.co.uk/uk-en/shop/mens"]
    start_urls=["http://www.paulsmith.co.uk/uk-en/shop/mens/jeans"]

    def parse(self,response):
        item= PaulsmithItem()
        #open('paulsmith.html','wb').write(response.body)
        hxs=HtmlXPathSelector(response)
        #sites=hxs.select('//div[@class="category-products"]')
        item['image_urls']=hxs.select("//div[@class='category-products']//a/img/@src").extract()
        #for site in sites:
            #print site.extract()
            #image = site.select('//a/img/@src').extract()
        return item


SPIDER = PaulSmithSpider()

1 个答案:

答案 0 :(得分:0)

您可能没有将IMAGES_STORE ='/ path /设置为/ valid / dir'

更多,尝试使用像这样的自定义图像管道:

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

您可以检查是否从方法“get_media_requests”

请求了image_urls

参考:http://doc.scrapy.org/en/latest/topics/images.html