如何正确使用sklearn cosine_similarity和字典

时间:2016-03-04 15:33:05

标签: python numpy dictionary scipy scikit-learn

我有一个包含id的csv文件,然后是4000个额外的浮点数列。 所以一行看起来像:

12323,3.8,3.1,4.2,.....

我正在尝试将单行与其他行进行比较,以使用余弦距离度量标准fid哪些行最相似。目前,我将numpy数组中的每一行单独比较为单个项目。我希望能够一次比较所有项目,而不是一次比较一个项目。问题是我使用字典,因为我使用键来引用每个对象。正如您在下面的代码中看到的那样,dict将id存储为键,然后该值是浮点数的数组。

#item is the single item
dict= {}
scores = {}
with open(file, 'rb') as csvfile:
  reader = csv.reader(csvfile, delimiter=',', quotechar='|')
  for row in reader:
    dict[row[1]] = np.loadtxt(row[2:],delimiter=',')

for k,v in dict.iteritems():
  score = cosine_similarity(item,v)
  scores[k] = score
sorted_scores = sorted(scores.items(), key=operator.itemgetter(1))
print(sorted_scores[-args.k:])

如何获得相同的结果,但不使用for循环对每一行进行单独评分。通过读取距离度量,我应该能够将整行数组传递给cosine_similarity。

1 个答案:

答案 0 :(得分:0)

You can use sklearn.metrics.pairwise.pairwise_distances which returns you a distance matrix, so you don't have to loop.

You can build a dictionary of id correspondences so you can compute distances and then easily access to them.

Working example:

import numpy as np
from sklearn.metrics import  pairwise 

main_object = [0.8,0.8]

X = np.matrix([
    [12345,0.8,0.9],
    [11111,0.9,0.1],
    [22222,0.7,.8]])

# you create a dictionary of <key=id, value=position(row in samples)>
dict_ids = {idx:int(id.item(0,0)) for idx,id in enumerate(X[:,0:1])}
print dict_ids 

# calculating distances in one shot, X[:,1:] this syntax omits ids col
dist = pairwise.pairwise_distances(X[:,1:], main_object, metric='cosine')
print dist 

# in dist now you have all distances to 'main_object'. Now you can play with it, for example if you want min dist:
print 'id:',dict_ids[np.argmin(dist)], 'dist:',min(dist)
# output id: 12345 dist: [ 0.00172563]