这是一个Matlab编码问题(一个稍微不同的版本,带有intersect而不是setdiff here:
有3个cols的评级矩阵A,第1个col是可能重复的用户ID,第2个col是可能重复的项目ID,第3个col是从用户到项目的评级,范围从1到5。
现在,我有一个用户ID smallUserIDList 的子集以及项目ID smallItemIDList 的子集,然后我想在 smallUserIDList 中找到用户评分的A行,并收集用户评价的项目,并进行一些计算,例如使用 <的setdiff em> smallItemIDList 并计算结果,如下面的代码所示:
userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
A2= A(A(:,1) == smallUserIDList(i), :);
itemIDList_each = unique(A2(:,2));
setDiff = setdiff(itemIDList_each , smallItemIDList);
userStat(i) = length(setDiff);
end
userStat
最后,我发现配置文件查看器显示上面的循环是低效的,问题是如何使用矢量化改进这段代码但是for循环的帮助?
例如:
输入:
A = [
1 11 1
2 22 2
2 66 4
4 44 5
6 66 5
7 11 5
7 77 5
8 11 2
8 22 3
8 44 3
8 66 4
8 77 5
]
smallUserIDList = [1 2 7 8]
smallItemIDList = [11 22 33 55 77]
输出:
userStat =
0
1
0
2
答案 0 :(得分:3)
据我所知,你的代码相当于:
%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));
%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];
%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);
%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
如果每个(user,item)
- 组合最多只有一个评级,这将有效。它也应该非常有效。
从统计工具箱中查看grpstats
!
实现可能与此类似:
%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});
%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);
%// Keep only users we care about (smallUserIDList)
ratings = ratings(ismember(ratings.user, smallUserIDList),:);
%// Compute the statistics grouped by 'user'.
userStat = grpstats(ratings, 'user');
答案 1 :(得分:2)
这可能是一个vectorized
方法 -
%// Take care of equality between first column of A and smallUserIDList to
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'
%// Take care of non-equality between second column of A and smallItemIDList.
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);
AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step
%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);
%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);
if ~isempty(C1)%//There is at least one element in C, so do further processing
%// Find the count of duplicate elements for each ID in C1 indexed into AR2.
%// NOTE: This corresponds to "unique(A2(:,2))" from OP.
dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
%// Get the count of matches for each ID in C in the mask1.
%// NOTE: This corresponds to:
%// "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
accums = accumarray(C,mask1);
%// Store the counts in output array and also subtract the dup counts
userStat(1:numel(accums)) = accums;
userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
下面列出的代码将建议方法的运行时与原始循环代码进行比较 -
%// Size parameters and random inputs with them
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);
disp('---------------------------- With Original Approach')
tic
%// Original posted code
toc
disp('---------------------------- With Proposed Approach'))
tic
%// Proposed approach code
toc
用三组数据获得的运行时间是 -
案例#1:
A_nrows = 500;
IDlist_len = 500;
max_userID = 100;
max_itemID = 100;
---------------------------- With Original Approach
Elapsed time is 0.136630 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.004163 seconds.
案例#2:
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 100;
max_itemID = 100;
---------------------------- With Original Approach
Elapsed time is 1.579468 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.050498 seconds.
案例#3:
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
---------------------------- With Original Approach
Elapsed time is 1.252294 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.044198 seconds.
结论:所提出的方法相对于原始循环代码的加速因此似乎很大!!
答案 2 :(得分:1)
我认为您正在尝试为一部分用户删除一组固定的评分,并计算剩余评分的数量:
以下是否有效:
Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;
你需要来自matlab中心文件交换的allcomb函数,它给出了两个向量的笛卡尔积,并且无论如何都很容易实现。