我有一个json文件,我试图删除重复的Json对象。下面提供了此文件的示例和我的方法。
{"published": "Tue, 03 Mar 2015 11:39:11 GMT", "title": "Goat Goat"}
{"published": "Tue, 03 Mar 2015 11:39:11 GMT", "title": "Goat Goat"}
{"published": "Tue, 03 Mar 2015 11:24:15 GMT", "title": "Cat cat"}
{"published": "Tue, 03 Mar 2015 11:19:29 GMT", "title": "Chicken Chicken"}
{"published": "Tue, 03 Mar 2015 11:19:29 GMT", "title": "Chicken Chicken"}
{"published": "Tue, 03 Mar 2015 10:50:15 GMT", "title": "Dog Dog"}
{"published": "Tue, 03 Mar 2015 10:34:45 GMT", "title": "Cat cat"}
我的方法是通过初始化一个空列表来创建脚本,读取文件中的每一行(对象)以检查唯一标题并将唯一对象写入新的json文件。
from sys import argv
script, input_file, output_file = argv
input_file = open(input_file)
output_file = open(output_file, 'a')
unique = []
while True:
A = input_file.readline()
if A['title'] not in unique:
unique.append(A['title'])
output_file.write(A)
但是,我收到以下错误消息:
Traceback (most recent call last):
File "test_run.py", line 13, in <module>
if A['title'] not in unique:
TypeError: string indices must be integers, not str
python的新手所以会欣赏任何想法。
答案 0 :(得分:1)
您可以将标题用作dict
对象中的键,并使用字典键是一组的事实:
#!/usr/bin/env python
import json
with open('your_json.json') as f:
# load json objects to dictionaries
jsons = map(json.loads, f)
uniques = {x['title']: x for x in jsons}
# write to new json file
with open('new_file.json' ,'w') as nf:
json.dump(uniques.values(), nf)
print uniques.values()
或者您可以更直接地使用json
和set
:
#!/usr/bin/env python
import json
with open('your_json.json') as f:
# load json objects to dictionaries
jsons = map(json.loads, f)
result = list()
items_set = set()
for js in jsons:
# only add unseen items (referring to 'title' as key)
if not js['title'] in items_set:
# mark as seen
items_set.add(js['title'])
# add to results
result.append(js)
# write to new json file
with open('new_file.json' ,'w') as nf:
json.dump(result, nf)
print result
输出:
[{u'title': u'Goat Goat', u'published': u'Tue, 03 Mar 2015 11:39:11 GMT'}, {u'title': u'Cat cat', u'published': u'Tue, 03 Mar 2015 11:24:15 GMT'}, {u'title': u'Chicken Chicken', u'published': u'Tue, 03 Mar 2015 11:19:29 GMT'}, {u'title': u'Dog Dog', u'published': u'Tue, 03 Mar 2015 10:50:15 GMT'}]
note :这会将列表序列化为列表,而不是像原始文件一样一行一行。为此,您可以使用:
# write to new json file
with open('new_file.json' ,'w') as nf:
for js in uniques.values():
nf.write(json.dumps(js))
nf.write('\n')
答案 1 :(得分:1)
您需要使用json库。而不是简单地阅读文件,使用:
import json
with open(input_file, 'r') as infile:
A = json.load(infile)
那应该解决这个问题。但是,您的代码还有一些问题。
您为什么使用while True
?这将永远不会终止,而是在input_file没有更多行时抛出异常。
相反,您只需将列表转换为一组即可保证唯一性。请注意,这将给出所有字段的唯一性,而不仅仅是标题。这样做如下:unique = set(A)
最后,您必须使用json库将其写入输出文件:
with open(output_file, 'w') as outfile:
json.dump(result, outfile)