scrapy docs中可能不正确的spider / exporter示例代码

时间:2012-01-31 15:14:40

标签: python scrapy

有人可以检查下面的代码是否正确吗? 代码见于 http://readthedocs.org/docs/scrapy/en/0.14/topics/exporters.html

我认为这是不正确的,因为:

  • 该类会跟踪多个蜘蛛的多个同时打开的文件,但是:
  • 每次打开新蜘蛛时,都会覆盖导出器(取决于文件)。

感谢您的帮助。

class XmlExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

1 个答案:

答案 0 :(得分:1)

我认为应该在scrapy-users group中提出这个问题。

AFAIK,因为v0.14 Scrapy不支持在一个进程中使用多个蜘蛛(related discussion),所以这段代码可以正常工作。多个蜘蛛的明显修复方法是使用exporters键进行spider dict:

class XmlExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}
        self.exporters = {}

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporters[spider] = XmlItemExporter(file)
        self.exporters[spider].start_exporting()

    def spider_closed(self, spider):
        self.exporters[spider].finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporters[spider].export_item(item)
        return item