使用加权数据查找更相似的项目

时间:2012-11-28 10:43:49

标签: python algorithm python-3.x

我有很多字典包含我音乐库中每位艺术家的加权标签,我想找到更相似的艺术家(也许还有相似度等级?),给出加权标签字典。

例如:

tags = {
    'grails': {
        'post-rock': 100,
        'instrumental': 53,
        'experimental': 38,
        'ambient': 30,
        'post rock': 14,
        'psychedelic': 11,
        'Psychedelic Rock': 6,
        'Progressive rock': 6,
        'rock': 4,
        'instrumental rock': 3,
        'atmospheric': 3,
        'american': 3,
        'space rock': 1
    },
    'camel': {
        'Progressive rock': 100,
        'classic rock': 28,
        'art rock': 24,
        'Progressive': 18,
        'rock': 17,
        'symphonic prog': 7,
        'british': 6,
        'Symphonic Rock': 4,
        'Canterbury Scene': 3,
        'prog rock': 3,
        'prog': 3,
        'Psychedelic Rock': 2,
        'space rock': 1
    },
    'mozart': {
        'Classical': 100,
        'mozart': 30,
        'instrumental': 21,
        'composers': 16,
        'opera': 13,
        'piano': 11,
        'Wolfgang Amadeus Mozart': 9,
        'symphonic': 9,
        'orchestral': 8,
        'austrian': 5
    }
    # etc.
}


best_matches({
            'Progressive rock': 100,
            'experimental': 33,
            'classic rock': 26,
            'Progressive': 23,
            'rock': 23,
            'art rock': 12,
            'psychedelic': 5,
            'prog rock': 5,
            'british': 5,
            'prog': 4,
            'Experimental Rock': 3,
            'Avant-Garde': 3,
            'Psychedelic Rock': 3,
            'Jazz Rock': 2
        }, tags)

# should output camel, then grails, then mozart

我听说过一些推荐算法,例如Slope One,但我想知道是否有更简单的方法用Python进行这种计算,以及最快的算法来“比较”所有这些词典。

2 个答案:

答案 0 :(得分:1)

如果您将每种音乐类型视为向量空间中的维度,则可以尝试cosine similarity或欧几里德距离。余弦相似性特别容易,它只是L2标准化的点积:

def intersect(a, b):
    """Intersection of a and b."""
    return (k for k in a if k in b)

def dot(a, b):
    """Dot product of values in a and b."""
    return sum((a[k] * b[k]) for k in intersect(a, b))

def l2norm(a):
    """L2 norm, aka Euclidean length, of a regarded as a vector."""
    return sqrt(sum(v ** 2 for v in a.itervalues()))

def similarity(a, b):
    """Cosine similarity of a and b."""
    return dot(a, b) / (l2norm(a) * l2norm(b))

如果您的所有权重/分数都是非负数,则返回0到1之间的数字,其中1表示完美匹配。您可以在任何有关信息检索的教科书中阅读有关余弦相似性的更多信息,例如Manning, Raghavan and Schütze

答案 1 :(得分:1)

您应将每个标记视为向量空间中的维度,并应用cosine similarity

例如:

import numpy as np

def cosine_similarity(dict1, dict2):
    sim = float(sum([dict1[k] * dict2[k] for k in intersect(dict1,dict2)]))
    return sim / (norm_values(dict1) * norm_values(dict2))

def norm_values(dict):
    v = np.array(dict.values())
    return np.sqrt(np.sum(np.square(v)))

def intersect(dict1,dict2):
    return list(set(dict1.keys()) & set(dict2.keys()))

tags = {
    'grails': {
        'post-rock': 100,
        'instrumental': 53,
        'experimental': 38,
        'ambient': 30,
        'post rock': 14,
        'psychedelic': 11,
        'Psychedelic Rock': 6,
        'Progressive rock': 6,
        'rock': 4,
        'instrumental rock': 3,
        'atmospheric': 3,
        'american': 3,
        'space rock': 1
    },
    'camel': {
        'Progressive rock': 100,
        'classic rock': 28,
        'art rock': 24,
        'Progressive': 18,
        'rock': 17,
        'symphonic prog': 7,
        'british': 6,
        'Symphonic Rock': 4,
        'Canterbury Scene': 3,
        'prog rock': 3,
        'prog': 3,
        'Psychedelic Rock': 2,
        'space rock': 1
    },
    'mozart': {
        'Classical': 100,
        'mozart': 30,
        'instrumental': 21,
        'composers': 16,
        'opera': 13,
        'piano': 11,
        'Wolfgang Amadeus Mozart': 9,
        'symphonic': 9,
        'orchestral': 8,
        'austrian': 5
    }
}
query = {
    'Progressive rock': 100,
    'experimental': 33,
    'classic rock': 26,
    'Progressive': 23,
    'rock': 23,
    'art rock': 12,
    'psychedelic': 5,
    'prog rock': 5,
    'british': 5,
    'prog': 4,
    'Experimental Rock': 3,
    'Avant-Garde': 3,
    'Psychedelic Rock': 3,
    'Jazz Rock': 2
}

for t in tags:
    print "{}: {}".format(t, cosine_similarity(tags[t], query))

这会产生:

mozart: 0.0
grails: 0.141356488829
camel: 0.944080602442