我有两个大数据集 - 搜索是340,000 x 1,字段是348,000 x 2.我的目标是在搜索中使用元素,在字段中找到它的位置(:,1),然后使用相应的值field(:,2)创建一个名为result的新单元格数组。
我直接使用cellfun耗尽了内存,因此我不得不将数据集拆分为子集,然后编译结果。
我制作了以下程序,但需要花费相当长的时间:2小时40分钟!
我的问题是,如何更有效地执行此任务?我是否需要修改现有代码,还是需要采取完全不同的方法来解决问题?
function result = bigdatacmp(search,field)
%BIGDATACMP(SEARCH,FIELD) takes strcmp jobs that require excessive amounts
% memory and splits them up into manageable subsets. The results of the
% subsets are then compiled to represent the original set.
tic
subsets = floor(size(search,1)/1000); %Divides search into subsets
difference = size(search,1) - 1000*subsets; %# of elements in last subset
result = cell(0); %Establish empty variables
%Loops through all subsets. Finds location of matches in the first column
%of field. Compiles subset locations. Compiles results from second column
%of field.
for i = 1:subsets
searchvalues = search(1000*i-999:1000*i);
Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),...
search(1000*i-999:1000*i),'UniformOutput',false);
result(1000*i-999:1000*i) = cellfun(@(x)(field(x,2)),...
Zlogic,'UniformOutput',false);
end
%Performs same calculations as in loop, but for the final subset.
Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),search(size(search,1)-...
difference+1:size(search,1)),'UniformOutput',false);
result(end+1:end+difference) = cellfun(@(x)(field(x,2)),Zlogic,...
'UniformOutput',false);
result = result';
toc
end
答案 0 :(得分:1)
containers.Map
对象,将field
的第一列中的内容映射到第二列中的相应条目。然后,您无需为field
中的每个条目执行search
的详尽搜索。
[已编辑添加:]如果348k是参赛作品的总数,我认为不需要进一步分割。