Question

我有一个参考字典，＆＃34; dictA＆＃34;我需要将它（计算键和vules之间的相似度）与当场生成的n个词典进行比较。每个字典都有相同的长度。让我们说，为了讨论的缘故，与它相比较的n个词典是3：dictB，dictC，dictD。

这是dictA的样子：

dictA={'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}

以下是dictB，dictC和dictD的样子：

dictB={'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"}
dictC={'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"}
dictD={'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}

我有一个解决方案，但只是选择两个词典：

sharedValue = set(dictA.items()) & set(dictD.items())
dictLength = len(dictA)
scoreOfSimilarity = len(sharedValue)
similarity = scoreOfSimilarity/dictLength

我的问题是：如何用dictA作为主词典来迭代n个词典，我将其他词典与其他词典进行比较。目标是获得相似性＆＃34;每个字典的值，我将迭代主要字典。

感谢您的帮助。

Answer 1

这是一个通用结构 - 假设您可以单独生成字典，在生成下一个字典之前使用每个字典。这听起来像你可能想要的。 calculate_similarity将是一个包含你＆＃34的函数;我有一个解决方案＆＃34;上面的代码。

reference = {'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}
while True:
    on_the_spot = generate_dictionary()
    if on_the_spot is None:
        break
    calculate_similarity(reference, on_the_spot)

如果您需要迭代已经生成的字典，那么您必须将它们放在可迭代的Python结构中。在生成它们时，创建一个词典列表：

victim_list = [
    {'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"},
    {'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"},
    {'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}
]
for on_the_spot in victim_list:
    # Proceed as above

您熟悉Python构造生成器吗？它就像一个函数，它以 yield 返回其值，而不是 return 。如果是这样，请使用它代替上面的列表。

Answer 2

根据您的问题设置，似乎没有其他方法可以循环遍历字典的输入列表。但是，这里可以应用多处理技巧。

以下是您的意见：

dict_a = {'1': "U", '2': "D", '3': "D", '4': "U", '5': "U", '6': "U"}
dict_b = {'1': "U", '2': "U", '3': "D", '4': "D", '5': "U", '6': "D"}
dict_c = {'1': "U", '2': "U", '3': "U", '4': "D", '5': "U", '6': "D"}
dict_d = {'1': "D", '2': "U", '3': "U", '4': "U", '5': "D", '6': "D"}
other_dicts = [dict_b, dict_c, dict_d]

除了我将用于循环技术的similarity1函数之外，我还将@ gary_fixler的地图技术包含为similarity2。

def similarity1(a):
    def _(b):
        shared_value = set(a.items()) & set(b.items())
        dict_length = len(a)
        score_of_similarity = len(shared_value)
        return score_of_similarity / dict_length
    return _

def similarity2(c):
    a, b = c
    shared_value = set(a.items()) & set(b.items())
    dict_length = len(a)
    score_of_similarity = len(shared_value)
    return score_of_similarity / dict_length

我们正在评估3种技术：
（1）@ gary_fixler的地图
（2）简单循环通过词典列表
（3）多处理dicts列表

以下是执行声明：

print(list(map(similarity1(dict_a), other_dicts)))
print([similarity2((dict_a, dict_v)) for dict_v in other_dicts])

max_processes = int(multiprocessing.cpu_count()/2-1)
pool = multiprocessing.Pool(processes=max_processes)
print([x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))])

您会发现所有3种技术都会产生相同的结果：

[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]

请注意，对于多处理，您拥有multiprocessing.cpu_count()/2核心（每个核心都具有超线程）。假设您的系统上没有其他任何运行，并且您的程序没有I / O或同步需求（就我们的问题而言），您通常会使用multiprocessing.cpu_count()/2-1进程获得最佳性能{{1为父进程而存在。

现在，计算3种技术：

-1

这会在我的笔记本电脑上产生以下结果：

print(timeit.timeit("list(map(similarity1(dict_a), other_dicts))",
                    setup="from __main__ import similarity1, dict_a, other_dicts", 
                    number=10000))

print(timeit.timeit("[similarity2((dict_a, dict_v)) for dict_v in other_dicts]",
                    setup="from __main__ import similarity2, dict_a, other_dicts", 
                    number=10000))

print(timeit.timeit("[x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))]",
                    setup="from __main__ import similarity2, dict_a, other_dicts, pool", 
                    number=10000))

您可以看到基本循环技术表现最佳。多处理比其他两种技术差得多，因为创建进程和来回传递数据的开销很大。这并不意味着多处理在这里没用。恰恰相反。查看大量输入词典的结果：

0.07092539698351175
0.06757041101809591
1.6528456939850003

这将字典列表扩展为384项。以下是此输入的时间结果：

for _ in range(7):
    other_dicts.extend(other_dicts)

对于任何更大的输入字典集，多处理技术变得最佳。

Answer 3

如果您将解决方案放在一个函数中，您可以通过名称为任何两个dicts调用它。此外，如果你通过分解嵌套函数中的参数来调整函数，你可以部分应用第一个dict来获取一个只想要第二个的函数（或者你可以使用functools.partial），这样就可以轻松实现图：

def similarity (a):
    def _ (b):
        sharedValue = set(a.items()) & set(b.items())
        dictLength = len(a)
        scoreOfSimilarity = len(sharedValue)
        return scoreOfSimilarity/dictLength
    return _

除此之外：上面也可以通过嵌套的lambdas写成单个表达式：

similarity = lambda a: lambda b: len(set(a.items()) & set(b.items)) / len(a)

现在你可以用地图得到dictA和其余部分之间的相似性：

otherDicts = [dictB, dictC, dictD]
scores = map(similarity(dictA), otherdicts)

现在，您可以使用min()（或max()或其他）从分数列表中获得最佳效果：

winner = min(scores)

警告：我没有测试上述任何一项。

Answer 4

感谢大家参与答案。这是我需要的结果：

def compareTwoDictionaries(self, absolute, reference, listOfDictionaries):
    #look only for absolute fit, yes or no
    if (absolute == True):
        similarity = reference == listOfDictionaries
    else:
        #return items that are the same between two dictionaries
        shared_items = set(reference.items()) & set(listOfDictionaries.items())
        #return the length of the dictionary for further calculation of %
        dictLength = len(reference)
        #return the length of shared_items for further calculation of %
        scoreOfSimilarity = len(shared_items)
        #return final score: similarity
        similarity = scoreOfSimilarity/dictLength
    return similarity

这是函数的调用

for dict in victim_list:
                output = oandaConnectorCalls.compareTwoDictionaries(False, reference, dict)

＆＃34;参考＆＃34; dict和＆＃34; victim_list＆＃34;如上所述使用dict。

计算相似度＆＃34;得分＆＃34;多个词典之间

4 个答案: