我正在为一个拥有100万(一个月)独特用户和13000个项目的门户网站实施推荐系统,我很想在大数据方面做得很好。
#---使用Sparse启动基于项目的建议---#
data = pd.read_csv('.../.csv').astype(float)
def cosine_similarities(mat):
col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
return col_normed_mat.T * col_normed_mat
data_germany = data.drop('user', 1)
data = csc_matrix(data)
data_germany = csr_matrix(data_germany)
csc = cosine_similarities(data_germany)
csc = csc.tocoo(copy=False)
csc.data
Out[74]:
array([ 0.02988072, 0.01698824, 0.0174342 , ..., 0.03207501,
0.09016696, 0.06804138])
在我的稀疏矩阵中有余弦距离我可以使用我的所有项目并通过行/列数据来建议它。这很简单。
问题是如何使用稀疏矩阵来实现和实现基于用户的CF. SciPy Matrices在现有方法的多样性方面非常糟糕。这些方法不允许我完全编写用于使用稀疏矩阵进行基于用户的CF的代码。我想用我的CF获得相同的效果但是使用稀疏矩阵。
#---启动基于用户的建议** ** 稀疏---#**
# Helper function to get similarity scores
def getScore(history, similarities):
return sum(history*similarities)/sum(similarities)
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0,len(data_sims.index)):
for j in range(1,len(data_sims.columns)):
user = data_sims.index[i]
product = data_sims.columns[j]
if data.ix[i][j] == 1:
data_sims.ix[i][j] = 0
else:
product_top_names = data_neighbours.ix[product][1:10]
product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
user_purchases = data_germany.ix[user,product_top_names]
data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
# Print a sample
print data_recommend.ix[:10,:4]