在Scrapy中的csvexports中导出具有追加模式的项目时,筛选重复的条目

时间:2018-03-23 06:03:43

标签: python scrapy export-to-csv scrapy-pipeline

我试图弄清楚如何预先检查某个项目是否已存在于要导出的csv文件的行中。如果该项目不存在,则需要附加该项目。否则该项目应该被丢弃。到目前为止,我已经完成了项目管道中的跟踪,但它不起作用,因为它无论如何都附加到csv文件。

我的Pipelines.py:

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
import csv

class BlogscrapePipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):

        file = open('%s_items.csv' % spider.name, 'a+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = ['Title','Link','Comments','Words']
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):

        with open('%s_items.csv' % spider.name, 'rt',encoding='utf-8') as file:
            reader=csv.reader(file)
            for row in reader:
                if item not in row:
                    self.exporter.export_item(item)
                    return item

items.py:

import scrapy

class BlogscrapeItem(scrapy.Item):

    Title=scrapy.Field()
    Link=scrapy.Field()
    Comments=scrapy.Field()
    Words=scrapy.Field()

2 个答案:

答案 0 :(得分:1)

使用项目管道是过滤重复项目的最佳方式

 from scrapy.exceptions import DropItem

 class FilterDuplicateItemsPipeline(object):

     items = set()
     configured = False

     def process_item(self, item, spider):
         if not self.configured:
            # TODO:
            # Extract items from previous csv
            # Add each item to the self.items
            self.configured = True

         if item not in self.items:
             self.items.add(item)
             return item
         else:
             raise DropItem('Duplicate item %s' % item)

您还必须将其添加到settings.py上的项目管道列表中:

ITEM_PIPELINES = {
    '{your_path}.FilterDuplicateItemsPipeline': 500,
}

编辑:这不是一个好的解决方案。请阅读以下评论。

答案 1 :(得分:0)

您的测试似乎不正确,因为例如,如果它看到该项不在第一行,即使它在第二行,它也会导出它。您必须在检查所有行后才导出。还要考虑save elements in a set before exporting and check the news against the set,我认为在文件方面效率更高。