Question

我有一个这样的列表：

[{'score': '92', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'},
 {'score': '61', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Sliding Door'}]

我想根据重复图像的imageId删除重复图像。因此，在上面的示例中，imageID 6184de26-e11d-4a7e-9c44-a1af8012d8d0出现了2次（保持得分最高）。

如何在Python中做到这一点？

Answer 1

我假设您想在此处保留得分最高的条目。试试这个：

my_list = [
    {'score': '92', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'},
    {'score': '61', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Sliding Door'}
]

by_id = {}
for element in my_list:
   imageId = element['imageId']
   if imageId in by_id:
       if int(by_id[imageId]['score']) < int(element['score']):
           # Replace because of higher score
           by_id[imageId] = element
   else:
       # Insert new element
       by_id[imageId] = element

print(list(by_id.values()))

Answer 2

使用groupby，

from itertools import groupby
new_list = [max(list(l),key=lambda x:x['score']) for _,l in groupby(sorted(lst,key=lambda x:x['imageId']),lambda x:x['imageId'])]

执行：

In [41]: lst = [{'score': '92', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'}, {'score': '61', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Sliding Door'}]

In [42]: print [max(list(l),key=lambda x:x['score']) for g,l in groupby(lst,lambda x:x['imageId'])]    
[{'score': '92', 'label': 'Door', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0'}]

Answer 3

我建议您对示例进行一些改进，以便：

它测试数字比较
它具有非连续的“重复”元素

我将创建一个标记dict，其ID为键，子列表为值。如果值较大（请不要忘记将其强制转换为整数），则循环输入并覆盖dict条目

my_list = [
    {'score': '192', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'},
    {'score': '61', 'imageId': 'fffffe26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'misc'},
    {'score': '761', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Sliding Door'},
    {'score': '45', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'},
]

import collections

d = dict()

for subdict in my_list:
    score = int(subdict['score'])
    image_id = subdict['imageId']
    if image_id not in d or int(d[image_id]['score']) < score:
        d[image_id] = subdict

new_list = list(d.values())

结果（当我们使用字典时顺序可能会改变）：

[{'imageId': 'fffffe26-e11d-4a7e-9c44-a1af8012d8d0',
  'label': 'misc',
  'score': '61'},
 {'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0',
  'label': 'Sliding Door',
  'score': '761'}]

Answer 4

如果您有大量数据，只需使用pandas.DataFrame（它的清理程序即可读取和维护）进行处理。

import pandas as pd

my_list = [
    {'score': '192', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'},
    {'score': '61', 'imageId': 'fffffe26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'misc'},
    {'score': '761', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Sliding Door'},
    {'score': '45', 'imageId': '6184de26-e11d-4a7e-9c44-a1af8012d8d0', 'label': 'Door'},
]

# create dataframe
df = pd.DataFrame(my_list)

# your score is string! convert it to int
df['score'] = df['score'].astype('int')

# sort values
df = df.sort_values(by=['imageId', 'score'], ascending=False)

# drop duplicates
df = df.drop_duplicates('imageId', keep='first')


    imageId                                 label           score
1   fffffe26-e11d-4a7e-9c44-a1af8012d8d0    misc            61
2   6184de26-e11d-4a7e-9c44-a1af8012d8d0    Sliding Door    761

在Python的列表列表中删除重复的元素

4 个答案: