将字典转换为稀疏矩阵

时间:2016-06-16 14:29:44

标签: python dictionary matrix

我有一个字典,其中key为user_ids,值为该用户喜欢的movie_ids列表,其中#unique_users = 573000和#unique_movies = 16000。

  

{1:[51,379,552,2333,2335,4089,4484],                2:[51,379,552,1674,1688,2333,3650,4089,4296,4484],                5:[783,909,1052,1138,1147,2676],                7:[171,321,959],                9:[3193],                10:[959],                11:[131,567,897,923],..........}

现在我想将其转换为矩阵,其中行为user_ids,列为movies_id,值为1,用户喜欢的电影为573000 * 16000

最终我必须将这个矩阵与它的转置相乘,以便与昏暗的共生矩阵(#unique_movies,#unique_movies)。

此外,X' * X操作的时间复杂度X(500000,12000)。

3 个答案:

答案 0 :(得分:4)

我认为你可以构造一个空dok_matrix并填充值。然后转置它并将其转换为csr_matrix以进行有效的矩阵乘法。

import numpy as np
import scipy.sparse as sp
d = {1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923]}

mat = sp.dok_matrix((573000,16000), dtype=np.int8)

for user_id, movie_ids in d.items():
    mat[user_id, movie_ids] = 1

mat = mat.transpose().tocsr()
print mat.shape

答案 1 :(得分:2)

df = {1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923],..........}
df2 = pd.DataFrame.from_dict(df, orient='index')
df2 = df2.stack().reset_index()
df2.level_1=1
df2.pivot(index='level_0',columns=0,values='level_1').fillna(0)

这会将dict转换为数据帧,然后堆叠以在单独的列中获取userID和movieID,然后将未使用的列level_1的所有值设置为1.最后一个语句创建一个数据透视表,用零填充不存在的组合。

答案 2 :(得分:0)

您可以一次创建csr_matrix(例如此格式:csr_matrix((data, (row_ind, col_ind)))。这是一个如何做到这一点的片段。

import scipy.sparse as sp
d = {0: [0,1], 1: [1,2,3], 
     2: [3,4,5], 3: [4,5,6], 
     4: [5,6,7], 5: [7], 
     6: [7,8,9]}
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
X = sp.csr_matrix(([1]*len(row_ind), (row_ind, col_ind))) # sparse csr matrix

您可以稍后使用矩阵X查找共生矩阵(即X.T * X)(credit github @ daniel-acuna)。我想有一种更快捷的方法可以将列表字典转换为row_indcol_ind