Question

我有一个数组A和一个引用数组B。 A的大小至少与B一样大。 e.g。

A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]

B实际上是指定时间信号中峰值的位置，A包含稍后峰值的位置。但是A中的一些元素实际上不是我想要的峰值（可能是由于噪音等），我想找到真正的＆＃39;基于A的{{1}}中的一个。真实的＆＃39; B中的元素应与A中的元素相近，并且在上面给出的示例中，“真实”中的元素应该与B中的元素相近。 A中的内容应为A'=[2,300,793,1300,1810]。在这个例子中应该很明显100,1500,2400不是我们想要的，因为它们与B中的任何元素相距甚远。我如何在python / matlab中以最有效/准确的方式编码？

Answer 1

方法＃1：使用NumPy broadcasting，我们可以查找输入数组之间的绝对元素减法，并使用适当的阈值从A过滤掉不需要的元素。对于给定的样本输入，似乎90的阈值有效。

因此，我们会有一个实现，如此 -

thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]

示例运行 -

In [69]: A
Out[69]: array([   2,  100,  300,  793, 1300, 1500, 1810, 2400])

In [70]: B
Out[70]: array([   4,  305,  789, 1234, 1890])

In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([   2,  300,  793, 1300, 1810])

方法＃2 ：基于this post，这是一种使用np.searchsorted的内存效率方法，这对大型数组来说至关重要 -

def searchsorted_filter(a, b, thresh):
    choices = np.sort(b) # if b is already sorted, skip it
    lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
    ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
    cl = np.take(choices,lidx) # Or choices[lidx]
    cr = np.take(choices,ridx) # Or choices[ridx]
    return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]

示例运行 -

In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([   2,  300,  793, 1300, 1810])

运行时测试

In [104]: A = np.sort(np.random.randint(0,100000,(1000)))

In [105]: B = np.sort(np.random.randint(0,100000,(400)))

In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]

In [107]: out2 = searchsorted_filter(A,B, thresh = 10)

In [108]: np.allclose(out1, out2)  # Verify results
Out[108]: True

In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop

In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop

2018年1月更新，进一步提升绩效

我们可以通过使用从np.searchsorted(..., 'right')获得的索引以及np.searchsorted(..., 'left')计算来避免第二次使用absolute，如此 -

def searchsorted_filter_v2(a, b, thresh):
    N = len(b)

    choices = np.sort(b) # if b is already sorted, skip it

    l = np.searchsorted(choices, a, 'left')
    l_invalid_mask = l==N
    l[l_invalid_mask] = N-1
    left_offset = choices[l]-a
    left_offset[l_invalid_mask] *= -1    

    r = (l - (left_offset!=0))
    r_invalid_mask = r<0
    r[r_invalid_mask] = 0
    r += l_invalid_mask
    right_offset = a-choices[r]
    right_offset[r_invalid_mask] *= -1

    out = a[(left_offset < thresh) | (right_offset < thresh)]
    return out

更新了测试进一步加速的时间 -

In [388]: np.random.seed(0)
     ...: A = np.random.randint(0,1000000,(100000))
     ...: B = np.unique(np.random.randint(0,1000000,(40000)))
     ...: np.random.shuffle(B)
     ...: thresh = 10
     ...: 
     ...: out1 = searchsorted_filter(A, B, thresh)
     ...: out2 = searchsorted_filter_v2(A, B, thresh)
     ...: print np.allclose(out1, out2)
True

In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop

In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop

深入挖掘 -

In [396]: a = A; b = B

In [397]: N = len(b)
     ...: 
     ...: choices = np.sort(b) # if b is already sorted, skip it
     ...: 
     ...: l = np.searchsorted(choices, a, 'left')

In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop

In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop

似乎searchsorted和sort几乎占用了所有运行时，它们似乎对此方法至关重要。因此，似乎无法通过这种基于排序的方法进一步改进。

Answer 2

您可以使用A找到B中每个值与bsxfun中每个值的距离，然后找到A中距离最近的点的索引使用B {/ 1}} min中的每个值。

[dists, ind] = min(abs(bsxfun(@minus, A, B.')), [], 2)

如果您使用R2016b，可以通过自动广播删除bsxfun

[dists, ind] = min(abs(A - B.'), [], 2);

如果您怀疑B中的某些值不是真正的峰值，那么您可以设置一个阈值并删除任何大于此值的距离。

threshold = 90;
ind = ind(dists < threshold);

然后我们可以使用ind索引A

output = A(ind);

Answer 3

您可以使用完全符合您要求的MATLAB interp1功能选项nearest用于查找最近的点，无需指定阈值。

out = interp1(A, A, B, 'nearest', 'extrap');

与其他方法比较：

A = sort(randi([0,1000000],1,10000));

B = sort(randi([0,1000000],1,4000));

disp('---interp1----------------')
tic
    out = interp1(A, A, B, 'nearest', 'extrap');
toc
disp('---subtraction with threshold------')
%numpy version is the same
tic
    [dists, ind] = min(abs(bsxfun(@minus, A, B.')), [], 2);
toc

结果：

---interp1----------------
Elapsed time is 0.00778699 seconds.
---subtraction with threshold------
Elapsed time is 0.445485 seconds.

interp1可用于大于10000和4000的输入但在subtrction方法中出现内存不足错误。

根据另一个参考数组从一个数组中选择紧密匹配

3 个答案: