两个字符词典的最大相似度?

时间:2013-02-27 23:49:51

标签: python

说我有一个python字典:

d = {"a":1, "b":2}

这表示字符在字符串中出现的次数。所以上面的字典可以生成一串“abb”,“bab”或“bba”。

两个词典之间的最大相似度是比率> = 0和< = 1,它描述了两个最相似生成的字符串的相似程度。

例如,

d1 = {"a":1, "b":2}
d2 = {"c": 3}
d3 = {"a":1, "d":2}

max_sim(d1, d2) # equals to 0.0 because no indexes 
# of an arrangement of ccc matches any indexes of an arrangement of abb
max_sim(d1, d3) # equals to 0.333 because an arrangement of add matches
# one out of three characters of an arrangement of abb
# note that if we compared dda and abb, the similarity ratio would be 0.0
# but we always take into account the most similarly generated strings

如何通过查看每个字符的出现次数,如何生成任意两个字典(相同长度)的最大相似度?即简单地分析字典而不是实际生成字符串并检查每对字符串的相似比。

注意:我在字典而不是字符串上使用max_sim,因为我已经通过两个字符串循环来收集它们的字典数据(除了别的东西)。如果我在两个字符串上使用max_sim(原始字符串或将字典转换回字符串),我想我只是在进行冗余计算。如果答案是两个字典作为输入,我会很感激。

1 个答案:

答案 0 :(得分:1)

这个怎么样:

def max_sim(d1, d2):
    # assume that's the same for both dicts
    length = sum(d1.values())
    matches = 0
    for letter in set(d1.keys() + d2.keys()):
        matches += min(d1.get(letter, 0), d2.get(letter, 0))
    return matches / float(length)

结果:

d1 = {"a":1, "b":2}
d2 = {"c": 3} 
d3 = {"a":1, "d":2}
d4 = {"a": 1, "b": 1, "c": 1 }

max_sim(d1, d2) # 0.0
max_sim(d1, d3) # 0.333
max_sim(d1, d4) # 0.666
max_sim(d1, d1) # 1.0