如何提高此查找程序的效率?

时间:2016-04-25 16:21:25

标签: performance matlab

我有两个大数据集 - 搜索是340,000 x 1,字段是348,000 x 2.我的目标是在搜索中使用元素,在字段中找到它的位置(:,1),然后使用相应的值field(:,2)创建一个名为result的新单元格数组。

我直接使用cellfun耗尽了内存,因此我不得不将数据集拆分为子集,然后编译结果。

我制作了以下程序,但需要花费相当长的时间:2小时40分钟!

我的问题是,如何更有效地执行此任务?我是否需要修改现有代码,还是需要采取完全不同的方法来解决问题?

function result = bigdatacmp(search,field)

%BIGDATACMP(SEARCH,FIELD) takes strcmp jobs that require excessive amounts
%   memory and splits them up into manageable subsets. The results of the
%   subsets are then compiled to represent the original set.


tic

subsets = floor(size(search,1)/1000);       %Divides search into subsets
difference = size(search,1) - 1000*subsets; %# of elements in last subset

result = cell(0);                           %Establish empty variables

%Loops through all subsets. Finds location of matches in the first column
%of field. Compiles subset locations. Compiles results from second column
%of field.
for i = 1:subsets

    searchvalues = search(1000*i-999:1000*i);

    Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),...
        search(1000*i-999:1000*i),'UniformOutput',false);

    result(1000*i-999:1000*i) = cellfun(@(x)(field(x,2)),...
        Zlogic,'UniformOutput',false);
end

%Performs same calculations as in loop, but for the final subset.
Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),search(size(search,1)-...
    difference+1:size(search,1)),'UniformOutput',false);

result(end+1:end+difference) = cellfun(@(x)(field(x,2)),Zlogic,...
    'UniformOutput',false);

result = result';

toc
end

1 个答案:

答案 0 :(得分:1)

348k并不是那么大。考虑构建一个containers.Map对象,将field的第一列中的内容映射到第二列中的相应条目。然后,您无需为field中的每个条目执行search的详尽搜索。

[已编辑添加:]如果348k是参赛作品的总数,我认为不需要进一步分割。