Question

我正在使用Movielens数据集。 ratings.dat / csv格式是

用户id，MovieId，评级，时间戳
1,1,5.0,52234234

初始数据集：

 user    movie   rating
    1       43      3
    1       57      2
    2       219     4

需要转向：

user        1   2
movie   43  3   0
        57  2   0
        219 0   4

为了提出建议，矩阵需要在行（用户）列（movieId）中以检查相似性。如本教程所示：http://maheshakya.github.io/gsoc/2014/05/18/preparing-a-bench-marking-data-set-using-singula-value-decomposition-on-movielens-data.html

我得到的输出是这样的：

8003 636e 756d 7079 2e63 6f72 652e 6d75
6c74 6961 7272 6179 0a5f 7265 636f 6e73 
7472 7563 740a 7100 636e 756d 7079 0a6e 
6461 7272 6179 0a71 014b 0085 7102 4301
6271 0387 7104 5271 0528 4b01 4dce 024d  
...
...

据我所知，为了检查相似性然后提出建议我需要一个矩阵，其中第一行（userId =＆＃34; 1＆＃34;）每部电影都有0-5（等级）值。

python脚本（我使用.dat和.csv文件）：

import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import pickle
data_file = pd.read_table(r'rat.csv', sep = ',', header=None,engine='python')
users = np.unique(data_file[0])
movies = np.unique(data_file[1])

number_of_rows = len(users)
number_of_columns = len(movies)

movie_indices, user_indices = {}, {}

for i in range(len(movies)):
    movie_indices[movies[i]] = i

for i in range(len(users)):
    user_indices[users[i]] = i
    #scipy sparse matrix to store the 1M matrix
V = sp.lil_matrix((number_of_rows, number_of_columns))

#adds data into the sparse matrix
for line in data_file.values:
    u, i , r , gona = map(int,line)
    V[user_indices[u], movie_indices[i]] = r
    #as these operations consume a lot of time, it's better to save processed data 
with open('movielens_1M.pickle', 'wb') as handle:
    pickle.dump(V, handle)
    #as these operations consume a lot of time, it's better to save processed data 
#gets SVD components from 10M matrix
u,s, vt = svds(V, k = 10)

with open('movielens_1M_svd_u.pickle', 'wb') as handle:
    pickle.dump(u, handle)
with open('movielens_1M_svd_s.pickle', 'wb') as handle:
    pickle.dump(s, handle)
with open('movielens_1M_svd_vt.pickle', 'wb') as handle:
    pickle.dump(vt, handle)
    s_diag_matrix = np.zeros((s.shape[0], s.shape[0]))

for i in range(s.shape[0]):
    s_diag_matrix[i,i] = s[i]
    X_lr = np.dot(np.dot(u, s_diag_matrix), vt)

with open('movielens.pickle', 'wb') as handle:
    pickle.dump(X_lr, handle)

Python SVD脚本，用于机器学习的格式Matrix

0 个答案: