更快的查找排序向量版本(MATLAB)

时间:2013-11-23 19:29:10

标签: algorithm matlab sorting optimization

我在MATLAB中有以下类型的代码:

indices = find([1 2 2 3 3 3 4 5 6 7 7] == 3)

返回4,5,6 - 数组中元素的索引等于3.现在。我的代码用很长的向量做这种事情。向量始终排序

因此,我想要一个用O(log n)替换find的O(n)复杂度的函数,代价是必须对数组进行排序。

我知道ismember,但据我所知,它不会返回所有项目的索引,只返回最后一项(我需要所有项目)。

出于便携性的原因,我需要解决方案仅限MATLAB(没有编译的mex文件等)。

5 个答案:

答案 0 :(得分:23)

这是使用二进制搜索的快速实现。该文件也可在github

上找到
function [b,c]=findInSorted(x,range)
%findInSorted fast binary search replacement for ismember(A,B) for the
%special case where the first input argument is sorted.
%   
%   [a,b] = findInSorted(x,s) returns the range which is equal to s. 
%   r=a:b and r=find(x == s) produce the same result   
%  
%   [a,b] = findInSorted(x,[from,to]) returns the range which is between from and to
%   r=a:b and r=find(x >= from & x <= to) return the same result
%
%   For any sorted list x you can replace
%   [lia] = ismember(x,from:to)
%   with
%   [a,b] = findInSorted(x,[from,to])
%   lia=a:b
%
%   Examples:
%
%       x  = 1:99
%       s  = 42
%       r1 = find(x == s)
%       [a,b] = myFind(x,s)
%       r2 = a:b
%       %r1 and r2 are equal
%
%   See also FIND, ISMEMBER.
%
% Author Daniel Roeske <danielroeske.de>

A=range(1);
B=range(end);
a=1;
b=numel(x);
c=1;
d=numel(x);
if A<=x(1)
   b=a;
end
if B>=x(end)
    c=d;
end
while (a+1<b)
    lw=(floor((a+b)/2));
    if (x(lw)<A)
        a=lw;
    else
        b=lw;
    end
end
while (c+1<d)
    lw=(floor((c+d)/2));
    if (x(lw)<=B)
        c=lw;
    else
        d=lw;
    end
end
end

答案 1 :(得分:15)

丹尼尔的方法很聪明,他的myFind2函数肯定很快,但是在边界条件附近或者在上限和下限产生传入集合之外的范围的情况下会出现错误/错误。

此外,正如他在回答他的评论时指出的那样,他的实施有一些效率低下的问题可以改进。我实现了他的代码的改进版本,它运行得更快,同时也正确处理边界条件。此外,此代码包含更多注释以解释正在发生的事情。我希望这能帮助别人丹尼尔的代码帮助我的方式!

function [lower_index,upper_index] = myFindDrGar(x,LowerBound,UpperBound)
% fast O(log2(N)) computation of the range of indices of x that satify the
% upper and lower bound values using the fact that the x vector is sorted
% from low to high values. Computation is done via a binary search.
%
% Input:
%
% x-            A vector of sorted values from low to high.       
%
% LowerBound-   Lower boundary on the values of x in the search
%
% UpperBound-   Upper boundary on the values of x in the search
%
% Output:
%
% lower_index-  The smallest index such that
%               LowerBound<=x(index)<=UpperBound
%
% upper_index-  The largest index such that
%               LowerBound<=x(index)<=UpperBound

if LowerBound>x(end) || UpperBound<x(1) || UpperBound<LowerBound
    % no indices satify bounding conditions
    lower_index = [];
    upper_index = [];
    return;
end

lower_index_a=1;
lower_index_b=length(x); % x(lower_index_b) will always satisfy lowerbound
upper_index_a=1;         % x(upper_index_a) will always satisfy upperbound
upper_index_b=length(x);

%
% The following loop increases _a and decreases _b until they differ 
% by at most 1. Because one of these index variables always satisfies the 
% appropriate bound, this means the loop will terminate with either 
% lower_index_a or lower_index_b having the minimum possible index that 
% satifies the lower bound, and either upper_index_a or upper_index_b 
% having the largest possible index that satisfies the upper bound. 
%
while (lower_index_a+1<lower_index_b) || (upper_index_a+1<upper_index_b)

    lw=floor((lower_index_a+lower_index_b)/2); % split the upper index

    if x(lw) >= LowerBound
        lower_index_b=lw; % decrease lower_index_b (whose x value remains \geq to lower bound)   
    else
        lower_index_a=lw; % increase lower_index_a (whose x value remains less than lower bound)
        if (lw>upper_index_a) && (lw<upper_index_b)
            upper_index_a=lw;% increase upper_index_a (whose x value remains less than lower bound and thus upper bound)
        end
    end

    up=ceil((upper_index_a+upper_index_b)/2);% split the lower index
    if x(up) <= UpperBound
        upper_index_a=up; % increase upper_index_a (whose x value remains \leq to upper bound) 
    else
        upper_index_b=up; % decrease upper_index_b
        if (up<lower_index_b) && (up>lower_index_a)
            lower_index_b=up;%decrease lower_index_b (whose x value remains greater than upper bound and thus lower bound)
        end
    end
end

if x(lower_index_a)>=LowerBound
    lower_index = lower_index_b;
else
    lower_index = lower_index_a;
end
if x(upper_index_b)<=UpperBound
    upper_index = upper_index_a;
else
    upper_index = upper_index_b;
end

请注意,Daniels searchFor功能的改进版现在只是:

function [lower_index,upper_index] = mySearchForDrGar(x,value)

[lower_index,upper_index] = myFindDrGar(x,value,value);

答案 2 :(得分:11)

如果查看第一个输出,

ismember将为您提供所有索引:

>> x = [1 2 2 3 3 3 4 5 6 7 7];
>> [tf,loc]=ismember(x,3);
>> inds = find(tf)

inds =

 4     5     6

您只需要使用正确的输入顺序。

请注意ismember使用的辅助函数可以直接调用:

% ISMEMBC  - S must be sorted - Returns logical vector indicating which 
% elements of A occur in S

tf = ismembc(x,3);
inds = find(tf);

使用ismembc可以节省自ismember首先调用issorted以来的计算时间,但这将省略检查。

请注意,较新版本的matlab具有builtin('_ismemberoneoutput',a,b)调用的内置函数,具有相同的功能。


由于ismember等的上述应用程序有些倒退(在第二个参数中搜索x的每个元素而不是相反),因此代码比必要慢得多。正如OP指出的那样,遗憾的是[~,loc]=ismember(3,x)仅提供x中第一次出现3的位置,而不是全部。但是,如果您有最新版本的MATLAB(我认为是R2012b +),您可以使用更多未记录的内置函数来获得第一个最后一个索引!这些是ismembc2builtin('_ismemberfirst',searchfor,x)

firstInd = builtin('_ismemberfirst',searchfor,x);  % find first occurrence
lastInd = ismembc2(searchfor,x);                   % find last occurrence
% lastInd = ismembc2(searchfor,x(firstInd:end))+firstInd-1; % slower
inds = firstInd:lastInd;

仍然比Daniel R.伟大的MATLAB代码慢,但它只是(rntmX添加到randomatlabuser的基准测试中)只是为了好玩:

mean([rntm1 rntm2 rntm3 rntmX])    
ans =
   0.559204323050486   0.263756852283128   0.000017989974213   0.000153682125682

以下是ismember.m中的这些功能的文档:

% ISMEMBC2 - S must be sorted - Returns a vector of the locations of
% the elements of A occurring in S.  If multiple instances occur,
% the last occurrence is returned

% ISMEMBERFIRST(A,B) - B must be sorted - Returns a vector of the
% locations of the elements of A occurring in B.  If multiple
% instances occur, the first occurence is returned.

实际上有一个ISMEMBERLAST内置引用,但它似乎不存在(但是?)。

答案 3 :(得分:6)

这不是答案 - 我只是比较chappjc和Daniel R建议的三种解决方案的运行时间。

N = 5e7;    % length of vector
p = 0.99;    % probability
KK = 100;    % number of instances
rntm1 = zeros(KK, 1);    % runtime with ismember
rntm2 = zeros(KK, 1);    % runtime with ismembc
rntm3 = zeros(KK, 1);    % runtime with Daniel's function
for kk = 1:KK
    x = cumsum(rand(N, 1) > p);
    searchfor = x(ceil(4*N/5));

    tic
    [tf,loc]=ismember(x, searchfor);
    inds1 = find(tf);
    rntm1(kk) = toc;

    tic
    tf = ismembc(x, searchfor);
    inds2 = find(tf);
    rntm2(kk) = toc;

    tic
    a=1;
    b=numel(x);
    c=1;
    d=numel(x);
    while (a+1<b||c+1<d)
        lw=(floor((a+b)/2));
        if (x(lw)<searchfor)
            a=lw;
        else
            b=lw;
        end
        lw=(floor((c+d)/2));
        if (x(lw)<=searchfor)
            c=lw;
        else
            d=lw;
        end
    end
    inds3 = (b:c)';
    rntm3(kk) = toc;

end

Daniel的二进制搜索非常快。

% Mean of running time
mean([rntm1 rntm2 rntm3])
% 0.631132275892504   0.295233981447746   0.000400786666188

% Percentiles of running time
prctile([rntm1 rntm2 rntm3], [0 25 50 75 100])
% 0.410663611685559   0.175298784336465   0.000012828868032
% 0.429120717937665   0.185935198821797   0.000014539383770
% 0.582281366154709   0.268931132925888   0.000019243302048
% 0.775917520641649   0.385297304740352   0.000026940622867
% 1.063753914942895   0.592429428396956   0.037773746662356

答案 4 :(得分:0)

我需要这样的功能。感谢@Daniel的帖子!

我用了一点工作因为我需要在同一个数组中找到几个索引。我想避免arrayfun(或类似)的开销或多次调用该函数。所以你可以在range中传递一堆值,你将得到数组中的索引。

function idx = findInSorted(x,range)
% Author Dídac Rodríguez Arbonès (May 2018)
% Based on Daniel Roeske's solution:
%   Daniel Roeske <danielroeske.de>
%   https://github.com/danielroeske/danielsmatlabtools/blob/master/matlab/data/findinsorted.m

    range = sort(range);
    idx = nan(size(range));
    for i=1:numel(range)
        idx(i) = aux(x, range(i));
    end
end

function b = aux(x, lim)
    a=1;
    b=numel(x);
    if lim<=x(1)
       b=a;
    end
    if lim>=x(end)
       a=b;
    end

    while (a+1<b)
        lw=(floor((a+b)/2));
        if (x(lw)<lim)
            a=lw;
        else
            b=lw;
        end
    end
end

我猜您可以使用parforarrayfun代替。尽管如此,我还没有测试自己的range大小。

另一种可能的改进是使用先前找到的索引(如果排序range)来减少搜索空间。由于O(log n)运行时,我对它节省CPU的潜力持怀疑态度。

最终功能最终运行得稍微快一些。我使用了@randomatlabuser的框架:

N = 5e6;    % length of vector
p = 0.99;    % probability
KK = 100;    % number of instances
rntm1 = zeros(KK, 1);    % runtime with ismember
rntm2 = zeros(KK, 1);    % runtime with ismembc
rntm3 = zeros(KK, 1);    % runtime with Daniel's function
for kk = 1:KK
    x = cumsum(rand(N, 1) > p);
    searchfor = x(ceil(4*N/5));

    tic
    range = sort(searchfor);
    idx = nan(size(range));
    for i=1:numel(range)
        idx(i) = aux(x, range(i));
    end

    rntm1(kk) = toc;

    tic
    a=1;
    b=numel(x);
    c=1;
    d=numel(x);
    while (a+1<b||c+1<d)
        lw=(floor((a+b)/2));
        if (x(lw)<searchfor)
            a=lw;
        else
            b=lw;
        end
        lw=(floor((c+d)/2));
        if (x(lw)<=searchfor)
            c=lw;
        else
            d=lw;
        end
    end
    inds3 = (b:c)';
    rntm2(kk) = toc;

end

%%

function b = aux(x, lim)

a=1;
b=numel(x);
if lim<=x(1)
   b=a;
end
if lim>=x(end)
   a=b;
end

while (a+1<b)
    lw=(floor((a+b)/2));
    if (x(lw)<lim)
        a=lw;
    else
        b=lw;
    end
end

end

这不是一个很大的改进,但它有帮助,因为我需要进行数千次搜索。

% Mean of running time
mean([rntm1 rntm2])
% 9.9624e-05  5.6303e-05

% Percentiles of running time
prctile([rntm1 rntm2], [0 25 50 75 100])
% 3.0435e-05  1.0524e-05
% 3.4133e-05  1.2231e-05
% 3.7262e-05  1.3369e-05
% 3.9111e-05  1.4507e-05
%  0.0027426   0.0020301

我希望这可以帮助别人。

修改

如果确实存在完全匹配的可能性很大,那么在调用函数之前使用非常快速的内置ismember是值得的:

[found, idx] = ismember(range, x);
idx(~found) = arrayfun(@(r) aux(x, r), range(~found));