Question

我有一个包含项目但没有用户评分的数据集。

项目具有特征（~400特征）。

我想基于特征（行相似性）来测量项目之间的相似性。

我将项目特征转换为二进制矩阵，如同下颚

itemID | feature1 | feature2 | feature3 | feature4 .... 1 | 0 | 1 | 1 | 0 2 | 1 | 0 | 0 | 1 3 | 1 | 1 | 1 | 0 4 | 0 | 0 | 1 | 1
我不知道使用什么（以及如何使用它）来测量行的相似性。

我希望，对于第X项，获得前k个相似的项目。

非常感谢示例代码

Answer 1

您正在寻找的是相似性度量。快速google / SO搜索将揭示各种方法来获得两个向量之间的相似性。下面是python2中用于余弦相似性的一些示例代码：

from math import *

def square_rooted(x):
    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):
    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return round(numerator/float(denominator),3)

print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

取自：http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

我注意到你想为每件商品购买前k个相似的商品。最好的方法是使用k Nearest Neighbor实现。您可以做的是创建一个knn图并从图表中返回前k个类似的项目以进行查询。

一个很棒的图书馆是nmslib。以下是针对具有余弦相似性的HNSW方法的knn查询from the library的示例代码（您可以使用几种可用方法中的一种.HNSW对于您的高维数据特别有效）：

import nmslib
import numpy

# create a random matrix to index
data = numpy.random.randn(10000, 100).astype(numpy.float32)

# initialize a new index, using a HNSW index on Cosine Similarity
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
index.createIndex({'post': 2}, print_progress=True)

# query for the nearest neighbours of the first datapoint
ids, distances = index.knnQuery(data[0], k=10)

# get all nearest neighbours for all the datapoint
# using a pool of 4 threads to compute
neighbours = index.knnQueryBatch(data, k=10, num_threads=4)

在代码结束时，每个数据点的k个顶级邻居将存储在neighbours变量中。您可以将它用于您的目的。

项目基于其功能的相似性

1 个答案: