说明：

输入中有两个列表。每个列表包含一系列dict格式如下：

{
    'a': 'foo',
    'b': 'bar',
    'switch': True
}

首先，我必须检查第二个列表中找到的二重奏a和b是否存在于第一个列表中，如果不存在，我将新二重奏添加到名为 added <的列表中/ strong>即可。同样，我必须检查第一个列表中找到的二重奏a和b是否存在于第二个列表中，否则，我将已删除的二重奏添加到名为已删除

然后，如果switch键相同，我必须检查每个列表中现有duo之间的beet。如果没有，我必须在切换列表中添加。

例如：

要恢复这个，这是一个例子：

# First list in input first = [ { 'a': 'foo', 'b': 'bar', 'switch': False },{ 'a': 'I_will', 'b': 'be_delisted', 'switch': True },{ 'a': 'I_will', 'b': 'be_switched', 'switch': True } ] # Second list to compare second = [ { 'a': 'foo', 'b': 'bar', 'switch': False },{ 'a': 'I_am', 'b': 'new', 'switch': True },{ 'a': 'I_will', 'b': 'be_switched', 'switch': False # switched } ] diff = my_diff(first, second)

预期产出：

{ 'added': [{ 'a': 'I_am', 'b': 'new', 'switch': True }], 'delisted': [{ 'a': 'I_will', 'b': 'be_delisted', 'switch': True }], 'switched': [{ 'a': 'I_will', 'b': 'be_switched', 'switch': False }] }

所以有两个不同的比较：

列表之间元素的比较

相同现有元素的内容比较

现有代码：

要进行列表之间的第一次比较，我使用hash函数来制作二重奏的哈希值以进行比较。然后，我在 first_hash 列表中添加此哈希，并使用每个元素的索引添加 second_hash 列表。

就像那样：

first_hash = [ ( hash((first[i]['a'], first[i]['b'])), i ) for i in xrange(0, len(first))] second_hash = [ ( hash((second[i]['a'], second[i]['b'])), i ) for i in xrange(0, len(second))]

我收到了添加和删除列表：

added = [ second[ e[1] ] for e in second_hash if e[0] not in (fh[0] for fh in first_hash) ] delisted = [ first[ e[1] ] for e in first_hash if e[0] not in (sh[0] for sh in second_hash) ]

我得到了两个列表中的相同元素，我将这些元素推入dict中，并使用键中的哈希值来轻松比较它：

sames_first = [ (e[0], first[ e[1] ]) for e in first_hash if e[0] in (sh[0] for sh in second_hash) ] # Getting the seconds same elements sames_second = [ (e[0], second[ e[1] ]) for e in second_hash if e[0] in (fh[0] for fh in first_hash) ] # Getting the first same elements sfirst = {} ssecond = {} for sf in sames_first: sfirst[sf[0]] = sf[1] for ss in sames_second: ssecond[ss[0]] = ss[1]

然后，我比较并获得切换列表：

switched = [ssecond[e] for e in ssecond.keys() if ssecond[e]['switch'] != sfirst[e]['switch']]

我将副本ssecond[e]（第二个列表的元素）推送到新值。

完整代码：

使用pastebin中的测试人员在本地进行测试：Pastebin

直接在线测试：Online testing

其实我得到了：

1.92713737488 ms for 100 element 162.150144577 ms for 1000 element 15205.0578594 ms for 10000 element

我的问题是：是否有更有效的方法在大型数据集上执行此任务？ （喜欢映射对象或他的索引和属性之一并直接比较它们吗？）

感谢任何会花一点时间阅读并尝试回应我的请求的人：）

Answer 1

您可以在dict中输出格式。使用列表理解，您可以以更合理的时间复杂度获得所需的输出。

    [res['switched'].append(i) if switchDict(i) in first else res['added'].append(i) if i not in first  else None for i in second ]

上面填充了切换（如果元素被发现为先打开）并添加（如果元素不存在于第一个）你的res dict的键。

res['delisted']=[i for i in first if i not in second and switchDict(i) not in res['switched']]

类似地，填充res列表的已删除密钥，条件是迭代第一个列表，方法是检查条件是否在第二个列表中不存在且未处于切换状态。

编辑制作了 - 在上面的代码段中检查switchDict(i) not in res['switched']而不是switchDict(i) not in second，以便将10000个元素的执行时间减少500毫秒（大约）！

因此，

def switchDict(d):
    return {'a':d['a'],'b':d['b'],'switch':not d['switch']}

def my_diff(first, second):
    res = dict.fromkeys(['added','switched','delisted'],[]) # to make things more pythonic!
    second = filter(None,[res['switched'].append(i) if switchDict(i) in first else res['added'].append(i) if i not in first  else i for i in second ]) 
    # filtering the missing elements alone that may not be delisted as storing it as second
    #thereby reducing the execution time by another 1000ms(approx)
    res['delisted']=[i for i in first if i not in second and switchDict(i) not in res['switched']]
    return res

将在

中为您提供适当的结果

0.0457763671875 ms for 10 element
1.32894515991 ms for 100 element
64.845085144 ms for 1000 element
6941.58291817 ms for 10000 element

（这里的时间取决于您共享的python文件生成的随机输入！）

希望它有所帮助！

Answer 2

我找到了另一种使用散列元素的解决方案：

def hash_elem(e):
    return hash( ( e['a'], e['b'] ) )

def my_diff(first, second):
    res = {'added':[],'switched':[],'delisted':[]}

    hf = {}
    hs = {}

    for ef in first:
        hf[hash_elem(ef)] = ef
    for es in second:
        hs[hash_elem(es)] = es

    sames = [s for s in hs.keys() if s in hf.keys()]
    [res['switched'].append(hs[s]) for s in sames if hs[s]['switch'] != hf[s]['switch']]

    [res['added'].append(hs[a]) for a in hs.keys() if a not in hf.keys()]
    [res['delisted'].append(hf[a]) for a in hf.keys() if a not in hs.keys()]

    return res

我得到了：

0.0219345092773 ms for 10 element
0.480175018311 ms for 100 element
38.6848449707 ms for 1000 element
6074.10311699 ms for 10000 element

我正在尝试将此与您的解决方案Keerthana Prabhakaran混合使用。

优化：搜索比较两个dict列表的最佳方法（Python）

说明：

例如：

现有代码：

完整代码：

2 个答案: