我正在尝试使用python scrapy从网站来源获取一些图片。
除了我的管道中没有访问的process_item方法之外,整个工作正常。
以下是我的文件:
Settings.py:
BOT_NAME = 'dealspider'
SPIDER_MODULES = ['dealspider.spiders']
NEWSPIDER_MODULE = 'dealspider.spiders'
DEFAULT_ITEM_CLASS = 'dealspider.items.DealspiderItem'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline', dealspider.ImgPipeline.MyImagesPipeline']
IMAGES_STORE = '/Users/Comp/Desktop/projects/ndailydeals/dimages/full'
ImgPipeline:
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
print "inside get_media_requests"
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
print "inside item_completed"
return item
def process_item(self, item, spider):
if spider.name == 'SgsnapDeal':
print "inside process_item"
# some code not relevant to the qn
deal = DailyDeals(source_website_url=source_website_url, source_website_logo=source_website_logo, description=description, price=price, url=url, image_urls=image_urls, city=city, currency=currency)
deal.save()
在运行抓取工具时没有“内部process_item”。我也尝试在scrapy.contrib.pipeline.images.py文件中添加process_item函数,但这也不起作用!
def process_item(self, item, info):
print "inside process"
pass
问题:一切正常,图像下载,image_paths设置等,我知道get_media_requests和item_completed在MyImagesPipeline中工作,因为有些打印语句,但不是process_item !!任何帮助将不胜感激..
编辑: 以下是其他相关文件:
蜘蛛:
from scrapy.spider import BaseSpider
from dealspider.items import DealspiderItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.pipeline.images import ImagesPipeline
class SG_snapDeal_Spider(BaseSpider):
name = 'SgsnapDeal'
allowed_domains = ['snapdeal.com']
start_urls = [
'http://www.snapdeal.com',
]
def parse(self, response):
item = DealspiderItem()
hxs = HtmlXPathSelector(response)
description = hxs.select('/html/body/div/div/div/div/div/div/div/div/div/a/div/div/text()').extract()
price = hxs.select('/html/body/div/div/div/div/div/div/div/div/div/a/div/div/div/span/text()').extract()
url = hxs.select('/html/body/div/div/div/div/div/div/div/div/div/a/@href').extract()
image_urls = hxs.select('/html/body/div/div/div/div/div/div/div/div/div/a/div/div/img/@src').extract()
item['description'] = description
item['price'] = price
item['url'] = url
item['image_urls'] = image_urls
#works fine!!
return item
SPIDER = SG_snapDeal_Spider()
Items.py:
from scrapy.item import Item, Field
class DealspiderItem(Item):
description = Field()
price = Field()
url = Field()
image_urls = Field()
images = Field()
image_paths = Field()
答案 0 :(得分:1)
您需要将process_item
放在单独的管道中,以便将您的项目保存在数据库中。
不在images pipeline
。
制作单独的管道
class OtherPipeline(object):
def process_item(self, item, info):
print "inside process"
pass
在pipleline
文件中加入settings