我试图弄清楚如何预先检查某个项目是否已存在于要导出的csv文件的行中。如果该项目不存在,则需要附加该项目。否则该项目应该被丢弃。到目前为止,我已经完成了项目管道中的跟踪,但它不起作用,因为它无论如何都附加到csv文件。
我的Pipelines.py:
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
import csv
class BlogscrapePipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'a+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['Title','Link','Comments','Words']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
with open('%s_items.csv' % spider.name, 'rt',encoding='utf-8') as file:
reader=csv.reader(file)
for row in reader:
if item not in row:
self.exporter.export_item(item)
return item
items.py:
import scrapy
class BlogscrapeItem(scrapy.Item):
Title=scrapy.Field()
Link=scrapy.Field()
Comments=scrapy.Field()
Words=scrapy.Field()
答案 0 :(得分:1)
使用项目管道是过滤重复项目的最佳方式
from scrapy.exceptions import DropItem
class FilterDuplicateItemsPipeline(object):
items = set()
configured = False
def process_item(self, item, spider):
if not self.configured:
# TODO:
# Extract items from previous csv
# Add each item to the self.items
self.configured = True
if item not in self.items:
self.items.add(item)
return item
else:
raise DropItem('Duplicate item %s' % item)
您还必须将其添加到settings.py
上的项目管道列表中:
ITEM_PIPELINES = {
'{your_path}.FilterDuplicateItemsPipeline': 500,
}
编辑:这不是一个好的解决方案。请阅读以下评论。
答案 1 :(得分:0)
您的测试似乎不正确,因为例如,如果它看到该项不在第一行,即使它在第二行,它也会导出它。您必须在检查所有行后才导出。还要考虑save elements in a set before exporting and check the news against the set,我认为在文件方面效率更高。