Question

我的数据集包含大约75000个用户和10000个项目，以及他们是否喜欢某些项目（1表示喜欢，0表示不喜欢）。对于每个用户，项目可以多次出现，并且用户对每个项目的响应独立于先前实例中的项目。这是我计算不同项目的用户偏好的代码：

data=csvread(datafile);
M = data; % Saving the original data in this variable to use later

% Arranging users in descending order of 'activity' (i.e., in order of the number of items liked or disliked by the users)
users = unique(data(:,10)); % The 10th column of the dataset contains the users
users(:,2) = histc(data(:,10),unique(data(:,10)));
users = flipdim(sortrows(users,2),1);

for i=1:size(users,1)

    % Finding the number of times the current user liked or disliked the items
    A=M(M(:,10)==users(i,1),2); % The second column of the dataset contains the items
    catA(:,1)=unique(A);
    catA(:,2)=histc(A,unique(A));


    totalA = sum(catA(:,2));
    catA(:,3)=catA(:,2)/totalA; % Calculating the fraction of items users liked

    allCatA(:,1)=unique(M(:,2)); 
    allCatA(:,2)=zeros(size(allCatA,1),1);

    % Calculating the current user's item preferences
    for k=1:size(catA,1)
        for l=1:size(allCatA,1)
            if catA(k,1)==allCatA(l,1)
                allCatA(l,2)=catA(k,3);
            end
        end
    end

    Y(:,i)=allCatA(:,2); % Saving the current user's item preferences in a vector

    clear allCatA; clear catA;
end

mxn矩阵Y包含m个n项用户的偏好设置，其中m=10000和n=75000在我的情况下。显示前5个用户的前5个项目首选项的示例输出如下所示：

0.0092308   0   0           0.0098361   0.0068729
0           0   0           0.0065574   0
0           0   0           0           0
0           0   0.016393    0.0065574   0
0.0030769   0   0.016393    0.04918     0

但这有两大问题：

计算需要很长时间（花了差不多一天）。
内存不足以存储75000x10000矩阵。

如何解决这个问题？

如何有效地计算大量用户和项目的用户项目偏好？

0 个答案: