Question

我有一个json文件，我试图删除重复的Json对象。下面提供了此文件的示例和我的方法。

{"published": "Tue, 03 Mar 2015 11:39:11 GMT", "title": "Goat Goat"}
{"published": "Tue, 03 Mar 2015 11:39:11 GMT", "title": "Goat Goat"}
{"published": "Tue, 03 Mar 2015 11:24:15 GMT", "title": "Cat cat"}
{"published": "Tue, 03 Mar 2015 11:19:29 GMT", "title": "Chicken Chicken"}
{"published": "Tue, 03 Mar 2015 11:19:29 GMT", "title": "Chicken Chicken"}
{"published": "Tue, 03 Mar 2015 10:50:15 GMT", "title": "Dog Dog"}
{"published": "Tue, 03 Mar 2015 10:34:45 GMT", "title": "Cat cat"}

我的方法是通过初始化一个空列表来创建脚本，读取文件中的每一行（对象）以检查唯一标题并将唯一对象写入新的json文件。

from sys import argv

script, input_file, output_file  = argv

input_file = open(input_file)

output_file = open(output_file, 'a')

unique = []

while True:
    A = input_file.readline()
    if A['title'] not in unique:
        unique.append(A['title'])
        output_file.write(A)

但是，我收到以下错误消息：

Traceback (most recent call last):
  File "test_run.py", line 13, in <module>
    if A['title'] not in unique:
TypeError: string indices must be integers, not str

python的新手所以会欣赏任何想法。

Answer 1

您可以将标题用作dict对象中的键，并使用字典键是一组的事实：

#!/usr/bin/env python
import json
with open('your_json.json') as f:
    # load json objects to dictionaries
    jsons = map(json.loads, f)

uniques = {x['title']: x for x in jsons}

# write to new json file
with open('new_file.json' ,'w') as nf:
    json.dump(uniques.values(), nf)

print uniques.values()

或者您可以更直接地使用json和set：

#!/usr/bin/env python
import json
with open('your_json.json') as f:
    # load json objects to dictionaries
    jsons = map(json.loads, f)

result = list()
items_set = set()

for js in jsons:
    # only add unseen items (referring to 'title' as key)
    if not js['title'] in items_set:
        # mark as seen
        items_set.add(js['title'])
        # add to results
        result.append(js)

# write to new json file
with open('new_file.json' ,'w') as nf:
    json.dump(result, nf)

print result

输出：

[{u'title': u'Goat Goat', u'published': u'Tue, 03 Mar 2015 11:39:11 GMT'}, {u'title': u'Cat cat', u'published': u'Tue, 03 Mar 2015 11:24:15 GMT'}, {u'title': u'Chicken Chicken', u'published': u'Tue, 03 Mar 2015 11:19:29 GMT'}, {u'title': u'Dog Dog', u'published': u'Tue, 03 Mar 2015 10:50:15 GMT'}]

note ：这会将列表序列化为列表，而不是像原始文件一样一行一行。为此，您可以使用：

# write to new json file
with open('new_file.json' ,'w') as nf:
    for js in uniques.values():
        nf.write(json.dumps(js))
        nf.write('\n')

Answer 2

您需要使用json库。而不是简单地阅读文件，使用：

import json

with open(input_file, 'r') as infile:
    A = json.load(infile)

那应该解决这个问题。但是，您的代码还有一些问题。

您为什么使用while True？这将永远不会终止，而是在input_file没有更多行时抛出异常。

相反，您只需将列表转换为一组即可保证唯一性。请注意，这将给出所有字段的唯一性，而不仅仅是标题。这样做如下：unique = set(A)

最后，您必须使用json库将其写入输出文件：

with open(output_file, 'w') as outfile:
    json.dump(result, outfile)

从文件中删除重复的JSON对象

2 个答案: