我有一个相似度矩阵(pandas Dataframe),我想遍历每种产品并获得最多5种相似产品,然后将它们放在称为itemAffinity
的最终Dataframe中相似度矩阵具有31878个项目(产品)=>表示31878列和31878行。执行以下功能,无法完成(需要很多时间)。
def get_items_similarity_score(similarity_matrix):
products_list = similarity_matrix.columns.values.tolist()
#Create an empty data frame to store item affinity scores for items.
itemAffinity= pd.DataFrame(columns=('item1', 'item2', 'score'))
rowCount=0
for item in products_list:
#get top 5 similar products which are not item
if isinstance(item,int):
series_sim = similarity_matrix.loc[item].nlargest(6)
#print series_sim
df = pd.DataFrame({'product':series_sim.index, 'score':series_sim.values})
df = df[df['product'] != item]
for r in range(len(df)):
itemAffinity.loc[rowCount] = [item,df.iloc[r]['product'],df.iloc[r]['score']]
rowCount +=1
itemAffinity.sort_values("score", ascending=False, inplace=True)
return itemAffinity
我用来生成相似度矩阵的函数:
def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim
有没有一种方法可以达到预期的结果,但效果更有效?
答案 0 :(得分:1)
让df
为您的相似度矩阵(我假设主对角线已被置零以避免高自相似度)。分别找到最大的列元素及其行索引,并将这两部分组合成一个新的数据框:
# Toy matrix
df = pd.DataFrame({'a':[0,0.1,0.2],
'b':[0.5,0.,0.7],
'c':[0.5,0.75,0]}, index=('a','b','c'))
best = pd.concat([df.idxmax(), df.max()], axis=1).reset_index()
best.columns = "prod1", "prod2", "sim"
# prod1 prod2 sim
#0 a c 0.20
#1 b c 0.70
#2 c b 0.75