找到最长的非纳米值序列但允许阈值

时间:2015-09-08 11:40:42

标签: matlab

是否有可能找到载体的非纳米值,但也允许n个nans?例如,如果我有以下数据:

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array
thres = 1; % this is the number of nans to allow

我想只保留最长的值序列与非nans但允许' n'数据中保留的nans数。所以,假设我愿意保持1纳米,我会得到

的输出
X_out = [8 10 11 nan 9 14 6 1 4 23 24]; %// output array

多数民众赞成,开始时的两个nans已被删除,因为它们超过了' thres'以上,但第三个纳米本身就可以保存在数据中。我想开发一种方法,可以将thres定义为任何值。

我可以用

找到非纳米值
Y = ~isnan(X); %// convert to zeros and ones

有什么想法吗?

4 个答案:

答案 0 :(得分:8)

为了找到最多包含thresholdNaN的最长序列,我们必须找到所述序列的开始和结束。

要生成所有可能的起点,我们可以使用hankel

H = hankel(X)

H =

    18     3   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24
     3   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0
   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0     0
   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0     0     0
     8    10    11   NaN     9    14     6     1     4    23    24     0     0     0     0
    10    11   NaN     9    14     6     1     4    23    24     0     0     0     0     0
    11   NaN     9    14     6     1     4    23    24     0     0     0     0     0     0
   NaN     9    14     6     1     4    23    24     0     0     0     0     0     0     0
     9    14     6     1     4    23    24     0     0     0     0     0     0     0     0
    14     6     1     4    23    24     0     0     0     0     0     0     0     0     0
     6     1     4    23    24     0     0     0     0     0     0     0     0     0     0
     1     4    23    24     0     0     0     0     0     0     0     0     0     0     0
     4    23    24     0     0     0     0     0     0     0     0     0     0     0     0
    23    24     0     0     0     0     0     0     0     0     0     0     0     0     0
    24     0     0     0     0     0     0     0     0     0     0     0     0     0     0 

现在我们需要找到每行中的最后一个有效元素。 为此,我们可以使用cumsum

C = cumsum(isnan(H),2)

C =

     0     0     1     2     2     2     2     3     3     3     3     3     3     3     3
     0     1     2     2     2     2     3     3     3     3     3     3     3     3     3
     1     2     2     2     2     3     3     3     3     3     3     3     3     3     3
     1     1     1     1     2     2     2     2     2     2     2     2     2     2     2
     0     0     0     1     1     1     1     1     1     1     1     1     1     1     1
     0     0     1     1     1     1     1     1     1     1     1     1     1     1     1
     0     1     1     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

每一行的结束点是C中相应元素最多threshold的结尾点:

threshold = 1;

T = C<=threshold

T =

 1     1     1     0     0     0     0     0     0     0     0     0     0     0     0
 1     1     0     0     0     0     0     0     0     0     0     0     0     0     0
 1     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 1     1     1     1     0     0     0     0     0     0     0     0     0     0     0
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1

使用以下命令找到最后一个有效元素:

[~,idx]=sort(T,2);
lastone=idx(:,end)

lastone =

 3     2     1     4    15    15    15    15    15    15    15    15    15    15    15

我们必须确保每行的实际长度得到尊重:

lengths = length(X):-1:1;
real_length = min(lastone,lengths);
[max_length,max_idx] = max(real_length)


max_length =

     11


max_idx =

     5

如果有更多相等的最大长度的序列,我们只需取第一个并显示它:

selected_max_idx = max_idx(1);
H(selected_max_idx, 1:max_length)


ans =

 8    10    11   NaN     9    14     6     1     4    23    24

完整脚本

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24];

H = hankel(X);
C = cumsum(isnan(H),2);

threshold = 1;

T = C<=threshold;
[~,idx]=sort(T,2);
lastone=idx(:,end)';

lengths = length(X):-1:1;
real_length = min(lastone,lengths);
[max_length,max_idx] = max(real_length);

selected_max_idx = max_idx(1);
H(selected_max_idx, 1:max_length)

答案 1 :(得分:5)

方法1:卷积

一种可能的方法是将Y = double(~isnan(X));n个窗口卷积,其中n减少直到找到可接受的子序列。 &#34;可接受&#34;表示子序列至少包含n-thres个,也就是说,卷积至少包含n-thres

Y = double(~isnan(X));
for n = numel(Y):-1:1 %// try all possible sequence lengths
    w = find(conv(Y,ones(1,n),'valid')>=n-thres); %// is there any acceptable subsequence?
    if ~isempty(w)
        break
    end
end
result = X(w:w+n-1);

Aproach 2:累积金额

Yn个窗口进行对话(如方法1中所示)相当于计算Y的累积和,然后采用n间距的差异。这在操作次数方面更有效。

Y = double(~isnan(X));
Z = cumsum(Y);
for n = numel(Y):-1:1
    w = find([Z(n) Z(n+1:end)-Z(1:end-n)]>=n-thres);
    if ~isempty(w)
        break
    end
end
result = X(w:w+n-1);

方法3:2D卷积

这实质上是计算方法1中循环的所有迭代。

Y = double(~isnan(X));
z = conv2(Y, tril(ones(numel(Y))));
[nn, ww] = find(bsxfun(@ge, z, (1:numel(Y)).'-thres)); %'
[n, ind] = max(nn);
w = ww(ind)-n+1;
result = X(w:w+n-1);

答案 2 :(得分:1)

让我们尝试我最喜欢的工具:RLE。 Matlab没有直接的功能,所以用我的“seqle”贴出来交换中心。 Seqle的默认设置是返回运行长度编码。所以:

>> foo = [ nan 1 2 3 nan nan 4 5 6 nan 5 5 5 ];

>> seqle(isnan(foo))
ans = 
    run: [1 3 2 3 1 3]
    val: [1 0 1 0 1 0]

“run”表示当前运行的长度; “val”表示该值。在这种情况下,val==1表示值为nanval==0表示数值。您可以看到,提取满足条件val==0 | run < 2的最长“运行”值序列相对容易,连续不超过一个nan。然后只需获取该运行的累积索引,这就是您想要的foo的子集。

编辑: 遗憾的是,通过代码提取的内容可能并不那么容易。我怀疑有一种更快的方法来使用longrun标识的索引来获得所需的子序列。

>> foo = [ nan 1 2 3 nan nan 4 5 6 nan nan 5 5 nan 5 nan 4 7 4 nan ];
>>  sfoo= seqle(isnan(foo))
sfoo = 
    run: [1 3 2 3 2 2 1 1 1 3 1]
    val: [1 0 1 0 1 0 1 0 1 0 1]
>> longrun = sfoo.run<2 |sfoo.val==0
longlong = 
    run: [2 1 1 1 6]
    val: [1 0 1 0 1]
% longrun identifies which indices might be part of a run
% longlong identifies the longest sequence of valid run 
>> longlong = seqle(longrun)
>> lfoo = find(sfoo.run<2 |sfoo.val==0);
>> sbar = seqle(lfoo,1);
>> maxind=find(sbar.run==max(sbar.run),1,'first');
>> getlfoo = lfoo( sum(sbar.run(1:(maxind-1)))+1 ); 
% first value in longrun , which is  part of max run
% getbar finds end of run indices
>> getbar = getlfoo:(getlfoo+sbar.run(maxind)-1);
>> getsbar = sfoo.run(getbar);
% retrieve indices of input vector 
>> startit = sum(sfoo.run(1:(getbar(1)-1))) +1;
>> endit = startit+ ((sum(sfoo.run(getbar(1):getbar(end ) ) ) ) )-1;
>> therun = foo( startit:endit )
therun =
     5     5   NaN     5   NaN     4     7     4   NaN

答案 3 :(得分:0)

嗯,谁不喜欢挑战,我的解决方案不如m.s。,但它是另一种选择。

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array
thresh =1;
X(isnan(X))= 0 ;

for i = 1:thresh
    Y(i,:) = circshift(X',-i); %//circular shift
end 

出于某种原因,Matlab反转&#34; &#39; &#34;使格式看起来很奇怪。

D = X + sum(Y,1);

Discard = find(D==0)+thresh; %//give you the index of the part that needs to be discarded

chunk = find(X==0); %//Segment the Vector into segments delimited by NaNs

seriesOfZero = circshift(chunk',-1)' - chunk;

bigchunk =[1 chunk( find(seriesOfZero ~= 1)) size(X,2)]; %//Convert series of NaNs into 1 chunk

[values,DiscardChunk] = intersect(bigchunk,Discard);
DiscardChunk =  sort(DiscardChunk,'descend')

for t = 1:size(DiscardChunk,2)
  X(bigchunk(DiscardChunk(t)-1):bigchunk(DiscardChunk(t))) = []; %//Discard the data
end
X(X == 0) = NaN
%//End of Code
  
    

8 10 11 NaN 9 14 6 1 4 23 24

  

当:     X = [18 3 nan nan nan 8 10 11 nan nan 9 14 6 1 nan nan nan 4 23 24]; %//输入数组     thresh = 2;

  
    

8 10 11 NaN 4 23 24