Question

我一直在寻找一种类似于set()的重复删除列表的方法，除了原始列表中的项目不可编辑（它们是dict s）。

我花了一些时间寻找足够的东西，最后我写了这个小函数：

def deduplicate_list(lst, key):
    output = []
    keys = []
    for i in lst:
        if not i[key] in keys:
            output.append(i)
            keys.append(i[key])

    return output

如果key被正确地给出并且是string，则此功能可以很好地完成其工作。毋庸置疑，如果我了解一个允许相同功能的内置或标准库模块，我很乐意放弃我的小程序，转而采用更标准和更强大的选择。

你知道这样的实施吗？

- 注意

以下单行found from this answer，

[dict(t) for t in set([tuple(d.items()) for d in l])]

虽然很聪明，但是无法工作，因为我必须将项目作为嵌套的dict使用。

- 示例

为清楚起见，以下是使用此类例程的示例：

with_duplicates = [
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    },
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    }
]

without_duplicates = deduplicate_list(with_duplicates, key='id')

Answer 1

您仅为dict的每个不同值选择列表中的第一个key。 itertools.groupby是可以为您执行此操作的内置工具 - 按key排序和分组，并且只从每个组中获取第一个：

from itertools import groupby

def deduplicate(lst, key):
    fnc = lambda d: d.get(key)  # more robust than d[key]
    return [next(g) for k, g in groupby(sorted(lst, key=fnc), key=fnc)]

Answer 2

此answer将有助于解决更一般的问题 - 查找不属于单个属性的唯一元素（在您的情况下为id），但如果任何嵌套属性不同

以下代码将返回唯一元素的索引列表

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

l = [
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    },
    {
        "type": "users",
        "attributes": {
            "first-name": "AAA",
            "email": "aaa.aaah@gmail.com",
            "last-name": "XXX",
            "handle": "jsmith"
        },
        "id": "1234"
    },
    {
        "type": "users",
        "attributes": {
            "first-name": "John",
            "email": "john.smith@gmail.com",
            "last-name": "Smith",
            "handle": "jsmith"
        },
        "id": "1234"
    },
]

# get indicies of unique elements
In [254]: list({make_hash(x):i for i,x in enumerate(l)}.values())
Out[254]: [1, 2]

Answer 3

您可以尝试一个简短版本，该版本基于您在问题中提供的答案链接：

key = "id"
deduplicated = [val for ind, val in enumerate(l)
                if val[key] not in [tmp[key] for tmp in l[ind + 1:]]]
print(deduplicated)

注意，这将采用重复的最后一个元素

Answer 4

在您的示例中，键返回的值是可清除的。如果总是如此，那么使用：

def deduplicate(lst, key):
    return list({item[key]: item for item in lst}.values())

如果有重复项，则仅保留最后匹配的副本。

我是否通过这种重复数据删除功能重新发明了轮子？

4 个答案: