scrapy如何处理Request的回调函数结果?

时间:2014-12-27 07:46:52

标签: python callback scrapy iterable

任何人都可以解释scrapy如何调用并处理Request的回调函数结果?

我理解scrapy可以接受Object的对象(Request,BaseItem,None)或Iterable的结果。例如:

1。返回对象(Request或BaseItem或None)

def parse(self, response):
    ...
    return scrapy.Request(...)

2。返回对象的Iterable

def parse(self, response):
    ...
    for url in self.urls:
        yield scrapy.Request(...)

我认为他们在scrapy的代码中处理过这样的事情。

# Assumed process_callback_result is a function that called after 
# a Request's callback function has been executed.
# The "result" parameter is the callback's returned value

def process_callback_result(self, result):

    if isinstance(result, scrapy.Request):
        self.process_request(result)

    elif isinstance(result, scrapy.BaseItem):
        self.process_item(result)

    elif result is None:
        pass

    elif isinstance(result, collections.Iterable):
        for obj in result:
            self.process_callback_result(obj)
    else:
        # show error message
        # ...

我在<PYTHON_HOME>/Lib/site-packages/scrapy/core/scraper.py函数中找到了_process_spidermw_output中的相应代码:

def _process_spidermw_output(self, output, request, response, spider):
    """Process each Request/Item (given in the output parameter) returned
    from the given spider
    """
    if isinstance(output, Request):
        self.crawler.engine.crawl(request=output, spider=spider)
    elif isinstance(output, BaseItem):
        self.slot.itemproc_size += 1
        dfd = self.itemproc.process_item(output, spider)
        dfd.addBoth(self._itemproc_finished, output, response, spider)
        return dfd
    elif output is None:
        pass
    else:
        typename = type(output).__name__
        log.msg(format='Spider must return Request, BaseItem or None, '
                       'got %(typename)r in %(request)s',
                level=log.ERROR, spider=spider, request=request, typename=typename)

但我无法找到elif isinstance(result, collections.Iterable):逻辑的一部分。

1 个答案:

答案 0 :(得分:6)

那是因为_process_spidermw_output只是单个项目/对象的处理程序。它是从scrapy.utils.defer.parallel调用的。这是处理蜘蛛输出的函数:

def handle_spider_output(self, result, request, response, spider):
        if not result:
            return defer_succeed(None)
        it = iter_errback(result, self.handle_spider_error, request, response, spider)
        dfd = parallel(it, self.concurrent_items,
            self._process_spidermw_output, request, response, spider)
        return dfd

来源:https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L163-L169

如您所见,它调用parallel并将_process_spidermw_output函数的句柄作为参数。参数名称为callable,并且为iterable的每个元素调用它,其中包含蜘蛛结果。 parallel函数是:

def parallel(iterable, count, callable, *args, **named):
    """Execute a callable over the objects in the given iterable, in parallel,
    using no more than ``count`` concurrent calls.
    Taken from: http://jcalderone.livejournal.com/24285.html
    """
    coop = task.Cooperator()
    work = (callable(elem, *args, **named) for elem in iterable)
    return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])

来源:https://github.com/scrapy/scrapy/blob/master/scrapy/utils/defer.py#L50-L58

基本上,这个过程是这样的:
调用enqueue_scrape时,会通过调用requestresponseslot.queue添加到slot.add_response_requestqueue然后由调用_scrape_next的{​​{1}}处理。 self._scrape函数将_scrape定义为回调函数,该函数将处理迭代器中的项。迭代器是在调用handle_spider_output时创建的,当它调用函数_scrape2时,会将回调注册到call_spider

scrapy.utils.spider.iterate_spider_output

最后,实际将单个项目None或迭代器转换为迭代器的函数是def iterate_spider_output(result): return [result] if isinstance(result, BaseItem) else arg_to_iter(result)

scrapy.utils.misc.arg_to_iter()