Scrapy - 不执行close方法体

时间:2017-11-23 10:17:39

标签: python web-scraping scrapy scrapy-spider

无法弄清楚为什么我的close方法没有被执行。我必须处理两个网址列表。必须首先处理一个列表并导出,然后处理第二个列表。

问题是只调用close方法(断点在def处停止)但未执行。你知道为什么吗?

# coding=utf-8
from bot.items import TestItem
from scrapy import Spider, Request, signals
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher

class IndexSpider(Spider):
    name = 'index_spider'
    allowed_domains = ['www.doman.org']

    def start_requests(self):

        for url in ["https://www.doman.org/eshop"]:

            yield Request(url, callback=self.parse_main_page)

    def parse_main_page(self, response):
        self.categories = [some tuples]
        self.subcategories = [some tuples]

    def close(self, spider): # Execution ends here
        pass # This is not being executed
        if self.categories:
            for cat in self.categories:
                url = "https://www.doman.org/search/getAjaxResult?categoryId={}".format(cat[0])
                yield Request(url, meta={'tup': cat, 'priority': 0}, priority=0, callback=self.parse_category)
            self.categories = []
            raise DontCloseSpider

enter image description here

2 个答案:

答案 0 :(得分:0)

close方法是一种静态方法:https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/init.py#L101因此您的close方法签名不匹配。

答案 1 :(得分:0)

我认为你需要像这样注册这个功能

class IndexSpider(Spider):

    def __init__(self, *args, **kwargs):
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        super(IndexSpider, self).__init__(*args, **kwargs)

    def spider_closed(self, spider): 
        pass # This is not being executed
        if self.categories:
            for cat in self.categories:
                url = "https://www.doman.org/search/getAjaxResult?categoryId={}".format(cat[0])
                yield Request(url, meta={'tup': cat, 'priority': 0}, priority=0, callback=self.parse_category)
            self.categories = []
            raise DontCloseSpider

此外,我不确定您是否可以在spider_closed函数内再发送请求,因为Spider已经在那里关闭了。

在您的情况下,我建议您从spider_closed方法中删除所有代码,然后只写下这样的打印消息

def spider_closed(self, spider):
      logging.info("spider_closed() called")

所以这样你知道调用了spider_closed,然后尝试在该方法中发送Request表单。