我有一个包含id的csv文件,然后是4000个额外的浮点数列。 所以一行看起来像:
12323,3.8,3.1,4.2,.....
我正在尝试将单行与其他行进行比较,以使用余弦距离度量标准fid哪些行最相似。目前,我将numpy数组中的每一行单独比较为单个项目。我希望能够一次比较所有项目,而不是一次比较一个项目。问题是我使用字典,因为我使用键来引用每个对象。正如您在下面的代码中看到的那样,dict将id存储为键,然后该值是浮点数的数组。
#item is the single item
dict= {}
scores = {}
with open(file, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in reader:
dict[row[1]] = np.loadtxt(row[2:],delimiter=',')
for k,v in dict.iteritems():
score = cosine_similarity(item,v)
scores[k] = score
sorted_scores = sorted(scores.items(), key=operator.itemgetter(1))
print(sorted_scores[-args.k:])
如何获得相同的结果,但不使用for循环对每一行进行单独评分。通过读取距离度量,我应该能够将整行数组传递给cosine_similarity。
答案 0 :(得分:0)
You can use sklearn.metrics.pairwise.pairwise_distances
which returns you a distance matrix, so you don't have to loop.
You can build a dictionary
of id
correspondences so you can compute distances and then easily access to them.
Working example:
import numpy as np
from sklearn.metrics import pairwise
main_object = [0.8,0.8]
X = np.matrix([
[12345,0.8,0.9],
[11111,0.9,0.1],
[22222,0.7,.8]])
# you create a dictionary of <key=id, value=position(row in samples)>
dict_ids = {idx:int(id.item(0,0)) for idx,id in enumerate(X[:,0:1])}
print dict_ids
# calculating distances in one shot, X[:,1:] this syntax omits ids col
dist = pairwise.pairwise_distances(X[:,1:], main_object, metric='cosine')
print dist
# in dist now you have all distances to 'main_object'. Now you can play with it, for example if you want min dist:
print 'id:',dict_ids[np.argmin(dist)], 'dist:',min(dist)
# output id: 12345 dist: [ 0.00172563]