Question

我有两个大数据集 - 搜索是340,000 x 1，字段是348,000 x 2.我的目标是在搜索中使用元素，在字段中找到它的位置（：，1），然后使用相应的值field（：，2）创建一个名为result的新单元格数组。

我直接使用cellfun耗尽了内存，因此我不得不将数据集拆分为子集，然后编译结果。

我制作了以下程序，但需要花费相当长的时间：2小时40分钟！

我的问题是，如何更有效地执行此任务？我是否需要修改现有代码，还是需要采取完全不同的方法来解决问题？

function result = bigdatacmp(search,field)

%BIGDATACMP(SEARCH,FIELD) takes strcmp jobs that require excessive amounts
%   memory and splits them up into manageable subsets. The results of the
%   subsets are then compiled to represent the original set.


tic

subsets = floor(size(search,1)/1000);       %Divides search into subsets
difference = size(search,1) - 1000*subsets; %# of elements in last subset

result = cell(0);                           %Establish empty variables

%Loops through all subsets. Finds location of matches in the first column
%of field. Compiles subset locations. Compiles results from second column
%of field.
for i = 1:subsets

    searchvalues = search(1000*i-999:1000*i);

    Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),...
        search(1000*i-999:1000*i),'UniformOutput',false);

    result(1000*i-999:1000*i) = cellfun(@(x)(field(x,2)),...
        Zlogic,'UniformOutput',false);
end

%Performs same calculations as in loop, but for the final subset.
Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),search(size(search,1)-...
    difference+1:size(search,1)),'UniformOutput',false);

result(end+1:end+difference) = cellfun(@(x)(field(x,2)),Zlogic,...
    'UniformOutput',false);

result = result';

toc
end

Answer 1

348k并不是那么大。考虑构建一个containers.Map对象，将field的第一列中的内容映射到第二列中的相应条目。然后，您无需为field中的每个条目执行search的详尽搜索。

[已编辑添加：]如果348k是参赛作品的总数，我认为不需要进一步分割。

如何提高此查找程序的效率？

1 个答案: