如何从多个JSON文件中删除重复?

时间:2017-10-17 10:47:48

标签: json python-3.x

我有多个包含大写字母和国家/地区的JSON文件。如何从所有文件中删除重复的键值对?

我有以下JSON文件之一

{
    "data": [
    {
        "Capital": "Berlin",
        "Country": "Germany"
    },
    {
        "Capital": "New Delhi",
        "Country": "India"
    },
    {
        "Capital": "Canberra",
        "Country": "Australia"
    },
    {
        "Capital": "Beijing.",
        "Country": "China"
    },
    {
        "Capital": "Tokyo",
        "Country": "Japan"
    },
    {
        "Capital": "Tokyo",
        "Country": "Japan"
    },
    {
        "Capital": "Berlin",
        "Country": "Germany"
    },
    {
        "Capital": "Moscow",
        "Country": "Russia"
    },
    {
        "Capital": "New Delhi",
        "Country": "India"
    },
    {
        "Capital": "Ottawa",
        "Country": "Canada"
    }
    ]

}

有很多这样的JSON文件包含重复项目。如何删除重复项目只保留第一次出现?我试过这个,但是没有用

dupes = []
for f in json_files:
    with open(f) as json_data:
        nations = json.load(json_data)['data']
        #takes care of duplicates and stores it in dupes
        dupes.append(x for x in nations if x['Capital'] in seen or seen.add(x['Capital']))
        nations = [x for x in nations if x not in dupes] #want to keep the first occurance of the item present in dupes

    with open(f, 'w') as json_data:
        json.dump({'data': nations}, json_data)

3 个答案:

答案 0 :(得分:2)

您可能无法使用酷列表理解,但常规循环应该可以使用

used_nations = {} 
for nation in nations:
    if nation['Capital'] in used_nations:
        nations.remove(nation)
    else:
        used_nations.add(nation['Capital']) 

答案 1 :(得分:1)

列表理解力很棒!但是......如果在此过程中涉及if语句,它们会使代码复杂化。

这绝不是经验法则。相反,我鼓励你经常使用列表推导。在这种特殊情况下,更加分散的解决方案更具可读性。

我的建议是:

import json

seen = []
result = []

with open('data.json') as json_data:
    nations = json.load(json_data)['data']
    #takes care of duplicates and stores it in dupes
    for item in nations:
        if item['Capital'] not in seen:
            seen.append(item['Capital'])
            result.append(item)

with open('data.no_dup.json', 'w') as json_data:
    json.dump({'data': result}, json_data)

经过测试并适用于Python 3.5.2。

请注意,为方便起见,我已移除了外环。

答案 2 :(得分:0)

以下是如何为您的给定json实现此目的的示例代码

import json

files = ['countries.json']

for f in files:
    with open(f,'r') as fp:
        nations = json.load(fp)
    result = [dict(tupleized) for tupleized in set(tuple(item.items())\
            for item in nations['data'])]
print result
print len(result)

输出:

[{u'Country': u'Russia', u'Capital': u'Moscow'}, {u'Country': u'Japan', u'Capital': u'Tokyo'}, {u'Country': u'Canada', u'Capital': u'Ottawa'}, {u'Country': u'India', u'Capital': u'New Delhi'}, {u'Country': u'Germany', u'Capital': u'Berlin'}, {u'Country': u'Australia', u'Capital': u'Canberra'}, {u'Country': u'China', u'Capital': u'Beijing.'}]
7