我在这种格式的.json文件中存储了一些结果:
(每行一项)
{"category": ["ctg1"], "pages": 3, "websites": ["x1.com","x2.com","x5.com"]}
{"category": ["ctg2"], "pages": 2, "websites": ["x1.com", "d4.com"]}
.
.
我尝试删除重复的值而不删除整个项目但没有成功。
代码:
import scrapy
import json
import codecs
from scrapy.exceptions import DropItem
class ResultPipeline(object):
def __init__(self):
self.ids_seen = set()
self.file = codecs.open('results.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
for sites in item['websites']:
if sites in self.ids_seen:
raise DropItem("Duplicate item found: %s" % sites)
else:
self.ids_seen.add(sites)
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
答案 0 :(得分:1)
只需重建ids_seen
列表中尚未包含的网站列表,而不是删除重复的项目。下面的示例代码应该可以使用,但它不在您的类结构中。
import json
line1 = '{"category": ["ctg1"], "pages": 3, "websites": ["x1.com","x2.com","x5.com"]}'
line2 = '{"category": ["ctg2"], "pages": 2, "websites": ["x1.com", "d4.com"]}'
lines = (line1, line2)
ids_seen = set()
def process_item(item):
item_unique_sites = []
for site in item['websites']:
if not site in ids_seen:
ids_seen.add(site)
item_unique_sites.append(site)
# Delete the duplicates
item['websites'] = item_unique_sites
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
print line
#self.file.write(line)
return item
for line in lines:
json_data = json.loads(line)
process_item(json_data)