我想计算稀疏矩阵的一行与其余行之间的成对余弦相似度。 (为什么?:因为每一行都是矢量化的product_title,并且我想提取具有id值的相似产品)。
以前,我将df_cleaned
作为<504x41732 sparse matrix>
(每一行,一个产品标题,而各列是由于令牌而产生的)。
我定义了:
def pairw_cos(prod_idx):
prod = df_cleaned[prod_idx]
foll_idx = prod_idx + 1 #thats a trick to select the rest of rows on the following line
candidates_matrix = scipy.sparse.vstack([df_cleaned[:prod_idx, :], df_cleaned[foll_idx:, :]])
simil_cosine = {}
for candidates_idx, single_candidate in candidates_matrix.iterrows():
single_simil = cosine_similarity(prod,single_candidate)
simil_cosine[candidates_idx] = single_simil
return pd.Series(simil_cosine)
但这不起作用(因为稀疏矩阵中不存在iterrows方法)。然后,我尝试了:
for row in candidates_matrix:
for candidates_idx, single_candidate in row:
single_simil = cosine_similarity(prod,single_candidate)
simil_cosine[candidates_idx] = single_simil
然后,在调用该函数时,我获得了:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-4c45754152cc> in <module>()
----> 1 pairw_cos2(2)
<ipython-input-52-12d55d3c35e5> in pairw_cos2(prod_idx)
7
8 for row in candidates_matrix:
----> 9 for candidates_idx, single_candidate in row:
10 single_simil = cosine_similarity(prod,single_candidate)
11 simil_cosine[candidates_idx] = single_simil
ValueError: not enough values to unpack (expected 2, got 1)
答案 0 :(得分:0)
如果有人问同样的问题,我终于解决了:
def pairwise_cosine(prod_idx):
prod = df_cleaned[prod_idx]
foll_idx = prod_idx + 1
candidates_matrix = scipy.sparse.vstack([df_cleaned[:prod_idx, :], df_cleaned[foll_idx:, :]])
simil_cosine = {}
to_enumerate = []
for row in candidates_matrix:
simil_per_row= []
simil_per_row = cosine_similarity(row,prod)
to_enumerate.append(simil_per_row)
for index, row in enumerate(candidates_matrix):
simil_cosine[index] = to_enumerate[index]
return pd.Series(simil_cosine)