我是scrapy的新手。我正在写一个蜘蛛下载images.for使用图像管道,安装PIL足够吗?我的PIL位于
/usr/lib/python2.7/dist-packages/PIL
如何将其包含在我的Scrapy项目中?
设置文件:
BOT_NAME = 'paulsmith'
BOT_VERSION = '1.0'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGE_STORE = '/home/jay/Scrapy/paulsmith/images'
SPIDER_MODULES = ['paulsmith.spiders']
NEWSPIDER_MODULE = 'paulsmith.spiders'
DEFAULT_ITEM_CLASS = 'paulsmith.items.PaulsmithItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
项目文件:
from scrapy.item import Item, Field
class PaulsmithItem(Item):
image_urls=Field()
image = Field()
pass
蜘蛛代码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from paulsmith.items import PaulsmithItem
class PaulSmithSpider(BaseSpider):
name="Paul"
allowed_domains=["http://www.paulsmith.co.uk/uk-en/shop/mens"]
start_urls=["http://www.paulsmith.co.uk/uk-en/shop/mens/jeans"]
def parse(self,response):
item= PaulsmithItem()
#open('paulsmith.html','wb').write(response.body)
hxs=HtmlXPathSelector(response)
#sites=hxs.select('//div[@class="category-products"]')
item['image_urls']=hxs.select("//div[@class='category-products']//a/img/@src").extract()
#for site in sites:
#print site.extract()
#image = site.select('//a/img/@src').extract()
return item
SPIDER = PaulSmithSpider()
答案 0 :(得分:0)
您可能没有将IMAGES_STORE ='/ path /设置为/ valid / dir'
更多,尝试使用像这样的自定义图像管道:
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
您可以检查是否从方法“get_media_requests”
请求了image_urls