如何在两个dicts中找到匹配值的dict键?

时间:2016-09-14 20:47:56

标签: python dictionary optimization

我有两个字典将ID映射到值。为简单起见,我们可以说这些是字典:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

如上所述,字典不是对称的。 我想从词典d_sourced_target获取 keys 字典,其值匹配。生成的字典将d_source个密钥作为自己的密钥,d_target个密钥作为密钥值(以listtupleset格式生成)

这将是上述示例的预期返回值应该是以下列表:

{'a': ('1', 'A'),
 'b': ('B',),
 'c': ('C',),
 '3': ('C',)}

有两个similar questions,但这些解决方案无法轻易应用于我的问题。

数据的一些特征:

  1. 来源通常小于目标。拥有大约几千个来源(顶部)和更多的目标。
  2. 同一个词典中的重复项(d_sourced_target)在值上不太可能。
  3. 预计
  4. 匹配(粗略估计)不超过d_source项目的50%。
  5. 所有键都是整数。
  6. 此问题的最佳(性能明智)解决方案是什么? 将数据建模到其他数据类型以提高性能是完全可以的,即使使用第三方库(我在考虑numpy

5 个答案:

答案 0 :(得分:2)

所有答案都有O(n^2)效率,这不是很好,所以我想回答自己。

我使用2(source_len) + 2(dict_count)(dict_len)内存,效率O(2n),这是我相信的最佳效果。

你走了:

from collections import defaultdict

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

def merge_dicts(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

new_dict = merge_dicts(d_source, d_target)

顺便说一句,我使用元组是为了不将结果列表链接在一起。

由于您已经添加了数据规范,因此这是一个更贴切的匹配解决方案:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

def second_merge_dicts(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

new_dict = second_merge_dicts(d_source, d_target)

答案 1 :(得分:1)

from collections import defaultdict
from pprint import pprint

d_source  = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

d_result = defaultdict(list)
{d_result[a].append(b) for a in d_source for b in d_target if d_source[a] == d_target[b]}

pprint(d_result)

<强>输出:

{'3': ['C'],
 'a': ['A', '1'],
 'b': ['B'],
 'c': ['C']}

计时结果:

from collections import defaultdict
from copy import deepcopy
from random import randint
from timeit import timeit


def Craig_match(source, target):
    result = defaultdict(list)
    {result[a].append(b) for a in source for b in target if source[a] == target[b]}
    return result

def Bharel_match(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

def modified_Bharel_match(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

# generate source, target such that:
# a) ~10% duplicate values in source and target
# b) 2000 unique source keys, 20000 unique target keys
# c) a little less than 50% matches source value to target value
# d) numeric keys and values
source = {}
for k in range(2000):
    source[k] = randint(0, 1800)
target = {}
for k in range(20000):
    if k < 1000:
        target[k] = randint(0, 2000)
    else:
        target[k] = randint(2000, 19000)

best_time = {}
approaches = ('Craig', 'Bharel', 'modified_Bharel')
for a in approaches:
    best_time[a] = None

for _ in range(3):
    for approach in approaches:
        test_source = deepcopy(source)
        test_target = deepcopy(target)

        statement = 'd=' + approach + '_match(test_source,test_target)'
        setup = 'from __main__ import test_source, test_target, ' + approach + '_match'
        t = timeit(stmt=statement, setup=setup, number=1)
        if not best_time[approach] or (t < best_time[approach]):
            best_time[approach] = t

for approach in approaches:
    print(approach, ':', '%0.5f' % best_time[approach])

<强>输出:

Craig : 7.29259
Bharel : 0.01587
modified_Bharel : 0.00682

答案 2 :(得分:1)

这是另一种解决方案。有很多方法可以做到这一点

for key1 in d1:
    for key2 in d2:
        if d1[key1] == d2[key2]:
            stuff

请注意,您可以使用key1和key2的任何名称。

答案 3 :(得分:1)

这可能是&#34;作弊&#34;在某些方面,虽然如果您要查找键的匹配值而不管区分大小写,那么您可以这样做:

import sets

aa = {'a': 1, 'b': 2, 'c':3}
bb = {'A': 1, 'B': 2, 'd': 3}

bbl = {k.lower():v for k,v in bb.items()}

result = {k:k.upper() for k,v in aa.iteritems() & bbl.viewitems()}
print( result )

<强>输出:

{'a': 'A', 'b': 'B'}

bbl声明会将bb键更改为小写(可以是aabb)。

*我只是在我的手机上测试了这个,所以我只想把这个想法扔到那里......而且,自从我开始撰写答案以来,你已经彻底改变了你的问题,所以你得到的是你得到。

答案 4 :(得分:0)

由您决定最佳解决方案。这是 a 解决方案:

def dicts_to_tuples(*dicts):
    result = {}
    for d in dicts:
        for k,v in d.items():
            result.setdefault(v, []).append(k)
    return [tuple(v) for v in result.values() if len(v) > 1]

d1 = {'a': 1, 'b': 2, 'c':3}
d2 = {'A': 1, 'B': 2}
print dicts_to_tuples(d1, d2)