从文件中删除重复的JSON对象

时间:2015-03-09 23:10:05

标签: python

我有一个json文件,我试图删除重复的Json对象。下面提供了此文件的示例和我的方法。

{"published": "Tue, 03 Mar 2015 11:39:11 GMT", "title": "Goat Goat"}
{"published": "Tue, 03 Mar 2015 11:39:11 GMT", "title": "Goat Goat"}
{"published": "Tue, 03 Mar 2015 11:24:15 GMT", "title": "Cat cat"}
{"published": "Tue, 03 Mar 2015 11:19:29 GMT", "title": "Chicken Chicken"}
{"published": "Tue, 03 Mar 2015 11:19:29 GMT", "title": "Chicken Chicken"}
{"published": "Tue, 03 Mar 2015 10:50:15 GMT", "title": "Dog Dog"}
{"published": "Tue, 03 Mar 2015 10:34:45 GMT", "title": "Cat cat"}

我的方法是通过初始化一个空列表来创建脚本,读取文件中的每一行(对象)以检查唯一标题并将唯一对象写入新的json文件。

from sys import argv

script, input_file, output_file  = argv

input_file = open(input_file)

output_file = open(output_file, 'a')

unique = []

while True:
    A = input_file.readline()
    if A['title'] not in unique:
        unique.append(A['title'])
        output_file.write(A)

但是,我收到以下错误消息:

Traceback (most recent call last):
  File "test_run.py", line 13, in <module>
    if A['title'] not in unique:
TypeError: string indices must be integers, not str

python的新手所以会欣赏任何想法。

2 个答案:

答案 0 :(得分:1)

您可以将标题用作dict对象中的键,并使用字典键是一组的事实:

#!/usr/bin/env python
import json
with open('your_json.json') as f:
    # load json objects to dictionaries
    jsons = map(json.loads, f)

uniques = {x['title']: x for x in jsons}

# write to new json file
with open('new_file.json' ,'w') as nf:
    json.dump(uniques.values(), nf)

print uniques.values()

或者您可以更直接地使用jsonset

#!/usr/bin/env python
import json
with open('your_json.json') as f:
    # load json objects to dictionaries
    jsons = map(json.loads, f)

result = list()
items_set = set()

for js in jsons:
    # only add unseen items (referring to 'title' as key)
    if not js['title'] in items_set:
        # mark as seen
        items_set.add(js['title'])
        # add to results
        result.append(js)

# write to new json file
with open('new_file.json' ,'w') as nf:
    json.dump(result, nf)

print result

输出:

[{u'title': u'Goat Goat', u'published': u'Tue, 03 Mar 2015 11:39:11 GMT'}, {u'title': u'Cat cat', u'published': u'Tue, 03 Mar 2015 11:24:15 GMT'}, {u'title': u'Chicken Chicken', u'published': u'Tue, 03 Mar 2015 11:19:29 GMT'}, {u'title': u'Dog Dog', u'published': u'Tue, 03 Mar 2015 10:50:15 GMT'}]

note :这会将列表序列化为列表,而不是像原始文件一样一行一行。为此,您可以使用:

# write to new json file
with open('new_file.json' ,'w') as nf:
    for js in uniques.values():
        nf.write(json.dumps(js))
        nf.write('\n')

答案 1 :(得分:1)

您需要使用json库。而不是简单地阅读文件,使用:

import json

with open(input_file, 'r') as infile:
    A = json.load(infile)

那应该解决这个问题。但是,您的代码还有一些问题。

您为什么使用while True?这将永远不会终止,而是在input_file没有更多行时抛出异常。

相反,您只需将列表转换为一组即可保证唯一性。请注意,这将给出所有字段的唯一性,而不仅仅是标题。这样做如下:unique = set(A)

最后,您必须使用json库将其写入输出文件:

with open(output_file, 'w') as outfile:
    json.dump(result, outfile)