我的数据集包含大约75000个用户和10000个项目,以及他们是否喜欢某些项目(1
表示喜欢,0
表示不喜欢)。对于每个用户,项目可以多次出现,并且用户对每个项目的响应独立于先前实例中的项目。这是我计算不同项目的用户偏好的代码:
data=csvread(datafile);
M = data; % Saving the original data in this variable to use later
% Arranging users in descending order of 'activity' (i.e., in order of the number of items liked or disliked by the users)
users = unique(data(:,10)); % The 10th column of the dataset contains the users
users(:,2) = histc(data(:,10),unique(data(:,10)));
users = flipdim(sortrows(users,2),1);
for i=1:size(users,1)
% Finding the number of times the current user liked or disliked the items
A=M(M(:,10)==users(i,1),2); % The second column of the dataset contains the items
catA(:,1)=unique(A);
catA(:,2)=histc(A,unique(A));
totalA = sum(catA(:,2));
catA(:,3)=catA(:,2)/totalA; % Calculating the fraction of items users liked
allCatA(:,1)=unique(M(:,2));
allCatA(:,2)=zeros(size(allCatA,1),1);
% Calculating the current user's item preferences
for k=1:size(catA,1)
for l=1:size(allCatA,1)
if catA(k,1)==allCatA(l,1)
allCatA(l,2)=catA(k,3);
end
end
end
Y(:,i)=allCatA(:,2); % Saving the current user's item preferences in a vector
clear allCatA; clear catA;
end
mxn
矩阵Y
包含m
个n
项用户的偏好设置,其中m=10000
和n=75000
在我的情况下。显示前5个用户的前5个项目首选项的示例输出如下所示:
0.0092308 0 0 0.0098361 0.0068729
0 0 0 0.0065574 0
0 0 0 0 0
0 0 0.016393 0.0065574 0
0.0030769 0 0.016393 0.04918 0
但这有两大问题:
75000x10000
矩阵。如何解决这个问题?