Question

是否有可能找到载体的非纳米值，但也允许n个nans？例如，如果我有以下数据：

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array
thres = 1; % this is the number of nans to allow

我想只保留最长的值序列与非nans但允许＆＃39; n＆＃39;数据中保留的nans数。所以，假设我愿意保持1纳米，我会得到

的输出

X_out = [8 10 11 nan 9 14 6 1 4 23 24]; %// output array

多数民众赞成，开始时的两个nans已被删除，因为它们超过了＆＃39; thres＆＃39;以上，但第三个纳米本身就可以保存在数据中。我想开发一种方法，可以将thres定义为任何值。

我可以用

找到非纳米值

Y = ~isnan(X); %// convert to zeros and ones

有什么想法吗？

Answer 1

为了找到最多包含threshold次NaN的最长序列，我们必须找到所述序列的开始和结束。

要生成所有可能的起点，我们可以使用hankel：

H = hankel(X)

H =

    18     3   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24
     3   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0
   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0     0
   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0     0     0
     8    10    11   NaN     9    14     6     1     4    23    24     0     0     0     0
    10    11   NaN     9    14     6     1     4    23    24     0     0     0     0     0
    11   NaN     9    14     6     1     4    23    24     0     0     0     0     0     0
   NaN     9    14     6     1     4    23    24     0     0     0     0     0     0     0
     9    14     6     1     4    23    24     0     0     0     0     0     0     0     0
    14     6     1     4    23    24     0     0     0     0     0     0     0     0     0
     6     1     4    23    24     0     0     0     0     0     0     0     0     0     0
     1     4    23    24     0     0     0     0     0     0     0     0     0     0     0
     4    23    24     0     0     0     0     0     0     0     0     0     0     0     0
    23    24     0     0     0     0     0     0     0     0     0     0     0     0     0
    24     0     0     0     0     0     0     0     0     0     0     0     0     0     0

现在我们需要找到每行中的最后一个有效元素。为此，我们可以使用cumsum：

C = cumsum(isnan(H),2)

C =

     0     0     1     2     2     2     2     3     3     3     3     3     3     3     3
     0     1     2     2     2     2     3     3     3     3     3     3     3     3     3
     1     2     2     2     2     3     3     3     3     3     3     3     3     3     3
     1     1     1     1     2     2     2     2     2     2     2     2     2     2     2
     0     0     0     1     1     1     1     1     1     1     1     1     1     1     1
     0     0     1     1     1     1     1     1     1     1     1     1     1     1     1
     0     1     1     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

每一行的结束点是C中相应元素最多threshold的结尾点：

threshold = 1;

T = C<=threshold

T =

 1     1     1     0     0     0     0     0     0     0     0     0     0     0     0
 1     1     0     0     0     0     0     0     0     0     0     0     0     0     0
 1     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 1     1     1     1     0     0     0     0     0     0     0     0     0     0     0
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1

使用以下命令找到最后一个有效元素：

[~,idx]=sort(T,2);
lastone=idx(:,end)

lastone =

 3     2     1     4    15    15    15    15    15    15    15    15    15    15    15

我们必须确保每行的实际长度得到尊重：

lengths = length(X):-1:1;
real_length = min(lastone,lengths);
[max_length,max_idx] = max(real_length)


max_length =

     11


max_idx =

     5

如果有更多相等的最大长度的序列，我们只需取第一个并显示它：

selected_max_idx = max_idx(1);
H(selected_max_idx, 1:max_length)


ans =

 8    10    11   NaN     9    14     6     1     4    23    24

完整脚本

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24];

H = hankel(X);
C = cumsum(isnan(H),2);

threshold = 1;

T = C<=threshold;
[~,idx]=sort(T,2);
lastone=idx(:,end)';

lengths = length(X):-1:1;
real_length = min(lastone,lengths);
[max_length,max_idx] = max(real_length);

selected_max_idx = max_idx(1);
H(selected_max_idx, 1:max_length)

Answer 2

方法1：卷积

一种可能的方法是将Y = double(~isnan(X));与n个窗口卷积，其中n减少直到找到可接受的子序列。＆＃34;可接受＆＃34;表示子序列至少包含n-thres个，也就是说，卷积至少包含n-thres。

Y = double(~isnan(X));
for n = numel(Y):-1:1 %// try all possible sequence lengths
    w = find(conv(Y,ones(1,n),'valid')>=n-thres); %// is there any acceptable subsequence?
    if ~isempty(w)
        break
    end
end
result = X(w:w+n-1);

Aproach 2：累积金额

将Y与n个窗口进行对话（如方法1中所示）相当于计算Y的累积和，然后采用n间距的差异。这在操作次数方面更有效。

Y = double(~isnan(X));
Z = cumsum(Y);
for n = numel(Y):-1:1
    w = find([Z(n) Z(n+1:end)-Z(1:end-n)]>=n-thres);
    if ~isempty(w)
        break
    end
end
result = X(w:w+n-1);

方法3：2D卷积

这实质上是计算方法1中循环的所有迭代。

Y = double(~isnan(X));
z = conv2(Y, tril(ones(numel(Y))));
[nn, ww] = find(bsxfun(@ge, z, (1:numel(Y)).'-thres)); %'
[n, ind] = max(nn);
w = ww(ind)-n+1;
result = X(w:w+n-1);

Answer 3

让我们尝试我最喜欢的工具：RLE。 Matlab没有直接的功能，所以用我的“seqle”贴出来交换中心。 Seqle的默认设置是返回运行长度编码。所以：

>> foo = [ nan 1 2 3 nan nan 4 5 6 nan 5 5 5 ];

>> seqle(isnan(foo))
ans = 
    run: [1 3 2 3 1 3]
    val: [1 0 1 0 1 0]

“run”表示当前运行的长度; “val”表示该值。在这种情况下，val==1表示值为nan，val==0表示数值。您可以看到，提取满足条件val==0 | run < 2的最长“运行”值序列相对容易，连续不超过一个nan。然后只需获取该运行的累积索引，这就是您想要的foo的子集。

编辑：遗憾的是，通过代码提取的内容可能并不那么容易。我怀疑有一种更快的方法来使用longrun标识的索引来获得所需的子序列。

>> foo = [ nan 1 2 3 nan nan 4 5 6 nan nan 5 5 nan 5 nan 4 7 4 nan ];
>>  sfoo= seqle(isnan(foo))
sfoo = 
    run: [1 3 2 3 2 2 1 1 1 3 1]
    val: [1 0 1 0 1 0 1 0 1 0 1]
>> longrun = sfoo.run<2 |sfoo.val==0
longlong = 
    run: [2 1 1 1 6]
    val: [1 0 1 0 1]
% longrun identifies which indices might be part of a run
% longlong identifies the longest sequence of valid run 
>> longlong = seqle(longrun)
>> lfoo = find(sfoo.run<2 |sfoo.val==0);
>> sbar = seqle(lfoo,1);
>> maxind=find(sbar.run==max(sbar.run),1,'first');
>> getlfoo = lfoo( sum(sbar.run(1:(maxind-1)))+1 ); 
% first value in longrun , which is  part of max run
% getbar finds end of run indices
>> getbar = getlfoo:(getlfoo+sbar.run(maxind)-1);
>> getsbar = sfoo.run(getbar);
% retrieve indices of input vector 
>> startit = sum(sfoo.run(1:(getbar(1)-1))) +1;
>> endit = startit+ ((sum(sfoo.run(getbar(1):getbar(end ) ) ) ) )-1;
>> therun = foo( startit:endit )
therun =
     5     5   NaN     5   NaN     4     7     4   NaN

Answer 4

嗯，谁不喜欢挑战，我的解决方案不如m.s。，但它是另一种选择。

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array
thresh =1;
X(isnan(X))= 0 ;

for i = 1:thresh
    Y(i,:) = circshift(X',-i); %//circular shift
end

出于某种原因，Matlab反转＆＃34; ＆＃39; ＆＃34;使格式看起来很奇怪。

D = X + sum(Y,1);

Discard = find(D==0)+thresh; %//give you the index of the part that needs to be discarded

chunk = find(X==0); %//Segment the Vector into segments delimited by NaNs

seriesOfZero = circshift(chunk',-1)' - chunk;

bigchunk =[1 chunk( find(seriesOfZero ~= 1)) size(X,2)]; %//Convert series of NaNs into 1 chunk

[values,DiscardChunk] = intersect(bigchunk,Discard);
DiscardChunk =  sort(DiscardChunk,'descend')

for t = 1:size(DiscardChunk,2)
  X(bigchunk(DiscardChunk(t)-1):bigchunk(DiscardChunk(t))) = []; %//Discard the data
end
X(X == 0) = NaN
%//End of Code

8 10 11 NaN 9 14 6 1 4 23 24

当： X = [18 3 nan nan nan 8 10 11 nan nan 9 14 6 1 nan nan nan 4 23 24]; ％//输入数组 thresh = 2;

8 10 11 NaN 4 23 24

找到最长的非纳米值序列但允许阈值

4 个答案:

方法1：卷积

Aproach 2：累积金额

方法3：2D卷积