任何人都可以解释scrapy如何调用并处理Request的回调函数结果?
我理解scrapy可以接受Object的对象(Request,BaseItem,None)或Iterable的结果。例如:
1。返回对象(Request或BaseItem或None)
def parse(self, response):
...
return scrapy.Request(...)
2。返回对象的Iterable
def parse(self, response):
...
for url in self.urls:
yield scrapy.Request(...)
我认为他们在scrapy的代码中处理过这样的事情。
# Assumed process_callback_result is a function that called after
# a Request's callback function has been executed.
# The "result" parameter is the callback's returned value
def process_callback_result(self, result):
if isinstance(result, scrapy.Request):
self.process_request(result)
elif isinstance(result, scrapy.BaseItem):
self.process_item(result)
elif result is None:
pass
elif isinstance(result, collections.Iterable):
for obj in result:
self.process_callback_result(obj)
else:
# show error message
# ...
我在<PYTHON_HOME>/Lib/site-packages/scrapy/core/scraper.py
函数中找到了_process_spidermw_output
中的相应代码:
def _process_spidermw_output(self, output, request, response, spider):
"""Process each Request/Item (given in the output parameter) returned
from the given spider
"""
if isinstance(output, Request):
self.crawler.engine.crawl(request=output, spider=spider)
elif isinstance(output, BaseItem):
self.slot.itemproc_size += 1
dfd = self.itemproc.process_item(output, spider)
dfd.addBoth(self._itemproc_finished, output, response, spider)
return dfd
elif output is None:
pass
else:
typename = type(output).__name__
log.msg(format='Spider must return Request, BaseItem or None, '
'got %(typename)r in %(request)s',
level=log.ERROR, spider=spider, request=request, typename=typename)
但我无法找到elif isinstance(result, collections.Iterable):
逻辑的一部分。
答案 0 :(得分:6)
那是因为_process_spidermw_output
只是单个项目/对象的处理程序。它是从scrapy.utils.defer.parallel
调用的。这是处理蜘蛛输出的函数:
def handle_spider_output(self, result, request, response, spider):
if not result:
return defer_succeed(None)
it = iter_errback(result, self.handle_spider_error, request, response, spider)
dfd = parallel(it, self.concurrent_items,
self._process_spidermw_output, request, response, spider)
return dfd
来源:https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L163-L169
如您所见,它调用parallel
并将_process_spidermw_output
函数的句柄作为参数。参数名称为callable
,并且为iterable
的每个元素调用它,其中包含蜘蛛结果。 parallel
函数是:
def parallel(iterable, count, callable, *args, **named):
"""Execute a callable over the objects in the given iterable, in parallel,
using no more than ``count`` concurrent calls.
Taken from: http://jcalderone.livejournal.com/24285.html
"""
coop = task.Cooperator()
work = (callable(elem, *args, **named) for elem in iterable)
return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])
来源:https://github.com/scrapy/scrapy/blob/master/scrapy/utils/defer.py#L50-L58
基本上,这个过程是这样的:
调用enqueue_scrape
时,会通过调用request
将response
和slot.queue
添加到slot.add_response_request
。 queue
然后由调用_scrape_next
的{{1}}处理。 self._scrape
函数将_scrape
定义为回调函数,该函数将处理迭代器中的项。迭代器是在调用handle_spider_output
时创建的,当它调用函数_scrape2
时,会将回调注册到call_spider
:
scrapy.utils.spider.iterate_spider_output
最后,实际将单个项目None或迭代器转换为迭代器的函数是def iterate_spider_output(result):
return [result] if isinstance(result, BaseItem) else arg_to_iter(result)
:
scrapy.utils.misc.arg_to_iter()