Question

我目前正在matlab中实现一种算法，该算法搜索购买某些文章的客户数据库。该数据库如下所示：

[ 0   1   2   3   4   5 NaN NaN;
  4   6   7   8 NaN NaN NaN NaN;
...]

那个东西的大小就是大小（数据）= [90810 30]。现在我想在该数据库中找到频繁的项目集（不过多使用工具箱）。我将在这里提供一个示例：

toyset = [
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9;
  5,  6,  7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
  5,  6,  7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
  1,  6,  7,  9, 10, 11,NaN,NaN,NaN,NaN;
  2,  4,  8, 11, 12,NaN,NaN,NaN,NaN,NaN];

当应用0.5 [support =（occurences_of_set）/（all_sets）]的最小支持时，这会产生以下项目集：

frequent_itemsets = [
  7,NaN,NaN;
  6,NaN,NaN;
  5,NaN,NaN;
  6,  7,NaN;
  5,  7,NaN;
  5,  6,NaN;
  5,  6,  7];

我现在的问题是找到项目集在数据集中的频率。目前我使用以下算法（btw工作得很好）：

function list = preprocess(subjectArray, combinations, progressBar)
% =========================================================================
% 
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
% 
% =========================================================================
% 
%   preprocesses the array; Finds the frequency of articles
%   subjectArray    - Array that contains customer data
%   combinations    - The article combinations to be found
%   progressBar     - The progress bar to indicate the progress of the 
%                     algorithm 
% 
% =========================================================================

    [countCustomers,maxSizeCustomers] = size(subjectArray);
    [countCombinations,sizeCombinations] = size(combinations);
    list=zeros(1,countCombinations);

    for i = 1:countCustomers
        waitbar(i/countCustomers,progressBar,sprintf('Preprocess: %.0f/%.0f\nSet size:%.0f',i,countCustomers,sizeCombinations));
        for k = 1 : countCombinations
            helpArray = zeros(1,maxSizeCustomers);
            help2Array = zeros(1,sizeCombinations);
            for j = 1:sizeCombinations
                helpArray = helpArray + (subjectArray(i,:) == combinations(k,j));
                help2Array(j) = any(helpArray);
            end
            list(k) = list(k) + all(help2Array);
        end
    end
end

我唯一的问题是需要年龄!!!从字面上看！有没有简单的可能性（除了长度为1的集合，我知道通过简单的计数可以更快），以使这更快？

我认为：

helpArray = helpArray + (subjectArray(i,j) == combinations(k,:));

是瓶颈吗？但我不确定，因为我不知道matlab在某些操作上有多快。

感谢您对此进行调查，请注意_

我最终做了什么：

function list = preprocess(subjectArray, combinations)
% =========================================================================
% 
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
% 
% =========================================================================
% 
%   preprocesses the array; Finds the frequency of articles
%   subjectArray    - Array that contains customer data
%   combinations    - The article combinations to be found
% 
% =========================================================================

    [countCustomers,maxSizeCustomers] = size(subjectArray);
    [countCombinations,sizeCombinations] = size(combinations);
    list=zeros(1,countCombinations);


    if sizeCombinations == 1
        for i = 1 : countCustomers
            for j = 1 : maxSizeCustomers
                x = subjectArray(i,j) + 1;
                if isnan(x), break; end
                list(x+1) = list(x+1) + 1;
            end
        end
    else
        for i = 1:countCombinations
            logical = zeros(size(subjectArray));
            for j = 1:sizeCombinations
                logical = logical + (subjectArray == combinations(i,j));
            end
            list(i) = sum(sum(logical,2) == sizeCombinations);
        end
    end
end

感谢所有支持！

Answer 1

很抱歉，我无法发表评论（我的声誉太低了，我想）频繁的项目集挖掘非常复杂。如果你有一个庞大的数据集并且你选择一个项目（集合）的低门槛是频繁的，你的方法（apriori？）你必须准备等待很长时间:) 通常，当您使用matlab处理嵌套for循环时，您也会遇到性能低下的问题。你选择了什么门槛？你的数据集有多大？

Answer 2

我立即看到三项建议：

首先，您的等候栏会为您的搜索添加额外的三分半钟。根据这个帖子：http://www.mathworks.com/matlabcentral/newsreader/view_thread/261380如果你包括等候栏，那么代码需要经过240,000个项目额外的550秒才能执行，缩放到90,000，你还有3分半钟的额外时间。

要计算最初频繁的选项，请使用逻辑索引的总和，例如，查看数据集中7的频率。

logical7=subjectArray==7;
numOf7s=sum(sum(logical7));

为每个值执行此操作，我感觉即使会有额外的代码，也会加快初始处理速度。

为了使代码更好，您可以执行

之类的操作

预分配逻辑垫，每个3d切片表示一个数字（第6个切片表示频率= 5，第7个切片表示频率= 6）

logMat=zeros([size(subjectArray) maxPossibleVal+1])％最大可能的val在前玩具箱中为9。

然后用逻辑#matricies

填充每个切片

for i=0:maxPossibleVal
  logMat(:,:,i+1) = subjectArray==i;
end

再一次，您可以从每个逻辑切片中获得总和以及小于特定阈值的总和，您可以从日志中删除。 mat（我也会使用逻辑索引来删除不符合阈值的切片）

现在，对所有逻辑索引进行处理的好处是，您可以将切片与加法或乘法相结合，以获得不同的组合频率。您甚至可以旋转这些操作的结果，然后使用“sum”命令，然后使用逻辑索引来获得两个或三个数字一起出现的频率。

logM7=logMat(:,:,8)
logM8=logMat(:,:,9)

combo7and8=logical(double(logM7)+double(logM8))

％你可以用|替换它使这更简单/更快

freq7and8=sum(sum(combo7and8')==2)

默认情况下，

％sum会找到列的总和，将我们的行转换为列，然后找出哪些行等于2，将所有逻辑1加在一起，然后你就得到了频率。每个数据集中的7和8个。

整篇文章可归纳为两件事：

取消等候栏

知道在代码中几乎可以使用逻辑索引，这比循环快得多

更快地搜索一个巨大的数组matlab

2 个答案: