我有一个字典列表,其中一个字典值name
包含我想要规范化的重复数据。该列表如下所示:
[
{'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 8},
{'name': 'None on file', 'document_id': 40, 'annotation_id': 5},
{'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 9},
{'name': 'Western Union', 'document_id': 61, 'annotation_id': 11}
]
我想要做的是创建一个只包含唯一名称的新词典。但我需要跟踪document_ids和annotation_ids。有时document_ids是相同的,但我只需跟踪它们与名称相关联。所以上面的列表会变成:
[
{'name': 'Craig McKray', 'document_ids': [50], 'annotation_ids': [8, 9]},
{'name': 'None on file', 'document_ids': [40], 'annotation_id': [5]},
{'name': 'Western Union', 'document_ids': [61], 'annotation_ids': [11]}
]
这是我到目前为止尝试过的代码:
result = []
# resolve duplicate names
result_row = defaultdict(list)
for item in data:
for double in data:
if item['name'] == double['name']:
result_row['name'] = item['name']
result_row['record_ids'].append(item['document_id'])
result_row['annotation_ids'].append(item['annotation_id'])
result.append(result_row)
代码的主要问题是我正在比较并找到重复项,但是当我迭代到下一个项目时,它会再次找到重复项,从而创建一些无限循环。如何编辑代码,以便不会一遍又一遍地比较重复项?
答案 0 :(得分:1)
new = dict()
for x in people:
if x['name'] in new:
new[x['name']].append({'document_id': x['document_id'], 'annotation_id': x['annotation_id']})
else:
new[x['name']] = [{'document_id': x['document_id'], 'annotation_id': x['annotation_id']}]
这是输出:
{'Craig McKray': [{'annotation_id': 8, 'document_id': 50}, {'annotation_id': 9, 'document_id': 50}], 'Western Union': [{'annotation_id': 11, 'document_id': 61}], 'None on file': [{'annotation_id': 5, 'document_id': 40}]}
在这里,我认为这可能对你更好:
from collections import defaultdict
new = defaultdict(dict)
for x in people:
if x['name'] in new:
new[x['name']]['document_ids'].append(x['document_id'])
new[x['name']]['annotation_ids'].append(x['annotation_id'])
else:
new[x['name']]['document_ids'] = [x['document_id']]
new[x['name']]['annotation_ids'] = [x['annotation_id']]
答案 1 :(得分:0)
更实用的itertools.groupby
方法可能就是这样。它有点神秘,所以我会解释。
from itertools import groupby
from operator import itemgetter
inp = [
{'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 8},
{'name': 'None on file', 'document_id': 40, 'annotation_id': 5},
{'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 9},
{'name': 'Western Union', 'document_id': 61, 'annotation_id': 11}
]
def groupvals(vals):
namegetter = itemgetter('name')
doccanngetter = itemgetter('document_id', 'annotation_id')
for grouper, grps in groupby(sorted(vals, key=namegetter), key=namegetter):
docanns = [set(param) for param in zip(*(doccanngetter(g) for g in grps))]
yield {'name': grouper, 'document_id': list(docanns[0]), 'annotation_id': list(docanns[1])}
for result in groupvals(inp):
print(result)
要使用groupby
,您需要一个排序列表。所以首先按名称排序。然后是groupby
名字。接下来,您可以提取document_id
和annotation_id
参数并压缩它们。这样可以将所有document_ids
放在列表中,将所有annotation_id
放在另一个列表中。然后,您可以调用set
删除重复项,并使用生成器将每个元素生成为dict
。
我使用了一个生成器,因为它避免了建立结果列表的需要。虽然如果你愿意,你可以这样做。
答案 2 :(得分:0)
我对这个问题的看法:
result = []
# resolve duplicate names
all_names = []
for i, item in enumerate(data):
if item['name'] in all_names:
continue
result_row = {'name': item['name'], 'record_ids': [item['document_id']],
'annotation_ids':[item['annotation_id']]}
all_names.append(item['name'])
for j, double in enumerate(data):
if item['name'] == double['name'] and i != j:
result_row['record_ids'].append(double['document_id'])
result_row['annotation_ids'].append(double['annotation_id'])
result.append(result_row)
答案 3 :(得分:0)
另一种选择:
from collections import defaultdict
catalog = defaultdict(lambda: defaultdict(list))
for d in dicts:
entry = catalog[d['name']]
for k in set(d) - {'name'}:
entry[k].append(d[k])
漂亮的印刷品
>>> for name, e in catalog.items():
>>> print "'{0}': {1}".format(name, e)
'Craig McKray': defaultdict(<type 'list'>, {'annotation_id': [8, 9], 'document_id': [50, 50]})
'Western Union': defaultdict(<type 'list'>, {'annotation_id': [11], 'document_id': [61]})
'None on file': defaultdict(<type 'list'>, {'annotation_id': [5], 'document_id': [40]})