我有一个Scrapy项目加载管道,但不会将项目传递给它们。任何帮助表示赞赏。
蜘蛛的精简版:
#imports
class MySpider(CrawlSpider):
#RULES AND STUFF
def parse_item(self, response):
'''Takes HTML response and turns it into an item ready for database. I hope.
'''
#A LOT OF CODE
return item
此时打印出该项会产生预期结果, settings.py 足够简单:
ITEM_PIPELINES = [
'mySpider.pipelines.MySpiderPipeline',
'mySpider.pipelines.PipeCleaner',
'mySpider.pipelines.DBWriter',
]
并且管道似乎正确(没有进口):
class MySpiderPipeline(object):
def process_item(self, item, spider):
print 'PIPELINE: got ', item['name']
return item
class DBWriter(object):
"""Writes each item to a DB. I hope.
"""
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb'
, host=settings['HOST']
, port=int(settings['PORT'])
, user=settings['USER']
, passwd=settings['PASS']
, db=settings['BASE']
, cursorclass=MySQLdb.cursors.DictCursor
, charset='utf8'
, use_unicode=True
)
print('init DBWriter')
def process_item(self, item, spider):
print 'DBWriter process_item'
query = self.dbpool.runInteraction(self._insert, item)
query.addErrback(self.handle_error)
return item
def _insert(self, tx, item):
print 'DBWriter _insert'
# A LOT OF UNRELATED CODE HERE
return item
class PipeCleaner(object):
def __init__(self):
print 'Cleaning these pipes.'
def process_item(self, item, spider):
print item['name'], ' is cleeeeaaaaannn!!'
return item
当我运行蜘蛛时,我在启动时获得此输出:
Cleaning these pipes.
init DBWriter
2012-10-23 15:30:04-0400 [scrapy] DEBUG: Enabled item pipelines: MySpiderPipeline, PipeCleaner, DBWriter
与爬网程序启动时打印到屏幕的 init 子句不同,process_item方法不会打印(或处理)任何内容。我越过我的手指,我忘记了一些非常简单的事情。
答案 0 :(得分:1)
2012-10-23 15:30:04-0400 [scrapy] DEBUG: Enabled item pipelines: MySpiderPipeline, PipeCleaner, DBWriter
此行显示您的管道正在初始化并且它们没问题。
问题是你的爬虫类,
class MySpider(CrawlSpider):
#RULES AND STUFF
def parse_item(self, response):
'''Takes HTML response and turns it into an item ready for database. I hope.
'''
#A LOT OF CODE
# before returning item , print it
return item
我认为你应该在从 MySpider 返回之前打印一个项目。
答案 1 :(得分:1)
“迟到总比不到好”
#imports
class MySpider(CrawlSpider):
#RULES AND STUFF
def parse_item(self, response):
'''Takes HTML response and turns it into an item ready for database. I hope.
'''
#A LOT OF CODE
yield item <------- yield instead of return