我有一个按字母顺序排列的大型字符串数组(~495千),有很多重复项(彼此相邻,因为它是按字母顺序排列的)。
对于给定的查找字符串,我需要找到列表中与我传入的字符串匹配的所有字符串。
我一直在使用strcmp(lookUpString,list)
来执行此操作,但这非常慢 - 我认为它会通过列表中的每个值进行比较,因为它不知道它按字母顺序排序。
我可以编写一个while循环遍历列表,使用strcmp
比较每个字符串,直到找到我想要的字符串块(然后停止),但我想知道是否有“matlab”这样做的方式(即对已排序的数组执行逻辑比较操作)。
感谢您的帮助!
答案 0 :(得分:4)
更新:我对之前的“方法3”感到不满意,所以我只是为了获得更好的性能而重新调整了一下。它现在的运行速度比天真strcmp
快10倍。
strcmp
在我的机器上胜出(2011b on Linux Mint 12)。特别是,它比ismember
工作得更好。但是,如果您自己进行一些手动预测,则可以获得一些额外的加速。考虑以下速度测试:
NumIter = 100;
N = 495000;
K = N / 20;
List = cell(N, 1);
for i = 1:20
List(i*K - K + 1:i*K) = cellstr(char(i+96));
end
StrToFind = cell(NumIter, 1);
for j = 1:NumIter
StrToFind{j} = char(round(rand * 20) + 96);
end
%# METHOD 1 (ismember)
tic
for j = 1:NumIter
Index1 = ismember(List, StrToFind{j});
Soln1 = List(Index1);
end
toc
%#METHOD 2 (strcmp)
tic
for j = 1:NumIter
Index2 = strcmp(StrToFind{j}, List);
Soln2 = List(Index2);
end
toc
%#METHOD 3 (strmp WITH MANUAL PRE-SORTING)
tic
for j = 1:NumIter
CurStrToFind = StrToFind{j};
K = 100;
I1 = zeros(K, 2); I1(1, :) = ones(1, 2);
I2 = zeros(K, 2); I2(end, 1) = 1; I2(end, 2) = N;
KDiv = floor(N/K);
for k = 2:K-1
CurSearchNum = k * KDiv;
CurListItem = List{CurSearchNum};
if CurListItem < CurStrToFind; I1(k, 1) = 1; end;
if CurListItem > CurStrToFind; I2(k, 1) = 1; end;
I1(k, 2) = CurSearchNum; I2(k, 2) = CurSearchNum;
end
a = find(I1(:, 1), 1, 'last');
b = find(I2(:, 1), 1, 'first');
ShortList = List(I1(a, 2):I2(b, 2));
Index3 = strcmp(CurStrToFind, ShortList);
Soln3 = ShortList(Index3);
end
toc
输出结果为:
Elapsed time is 6.411537 seconds.
Elapsed time is 1.396239 seconds.
Elapsed time is 0.150143 seconds.
答案 1 :(得分:1)
ismember是你的朋友。而不是线性搜索,it does binary search.
答案 2 :(得分:0)
尝试二元搜索。
快了近13(!)倍:
Elapsed time is 7.828260 seconds. % ismember
Elapsed time is 0.775260 seconds. % strcmp
Elapsed time is 0.113533 seconds. % strmp WITH MANUAL PRE-SORTING
Elapsed time is 0.008243 seconds. % binsearch
这是我正在使用的bin搜索代码:
function ind = binSearch(key, cellstr)
% BINSEARCH that find index i such that cellstr(i)<= key <= cellstr(i+1)
%
% * Synopsis: ind = binSearch(key, cellstr)
% * Input : key = what to search for
% : cellstr = sorted cell-array of string (others might work, check strlexcmp())
% * Output : ind = index in x cellstr such that cellstr(i)<= key <= cellstr(i+1)
% * Depends : strlexcmp() from Peter John Acklam’s string-utilities,
% at: http://home.online.no/~pjacklam/matlab/software/util/strutil/
%
% Transcoded from a Java version at: http://googleresearch.blogspot.it/2006/06/extra-extra-read-all-about-it-nearly.html
% ankostis, Aug 2013
low = 1;
high = numel(cellstr);
while (low <= high)
ind = fix((low + high) / 2);
val = cellstr{ind};
d = strlexcmp(val, key);
if (d < 0)
low = ind + 1;
elseif (d > 0)
high = ind - 1;
else
return; %% Key found.
end
end
ind = -(low); %% Not found!
end
您可以从Peter John Acklam的字符串实用程序中获取strlexcmp()
,位于:
http://home.online.no/~pjacklam/matlab/software/util/strutil/
最后这是我使用的测试脚本:
%#METHOD 4 (WITH BIN-SEARCH)
tic
for j = 1:NumIter
Index1 = binsearch(StrToFind{j}, List);
Soln4 = List(Index1);
end
请注意,对于较长的字符串,结果可能会有所不同。