Question

我们假设您有一个排序数字sorted_array和一个阈值threshold。找到所有值都在阈值范围内的最长子阵列的最快方法是什么？换句话说，找到索引i和j，以便：

sorted_array[j] - sorted_array[i] <= threshold
j - i是最大的

如果出现平局，请返回最小i的对。

我已经有了一个基于循环的解决方案，我将作为答案发布，但我很想知道是否有更好的方法，或者是一种可以避免使用矢量循环的方法 - 例如，像NumPy这样的语言或库。

输入和输出示例：

>>> sorted_array = [0, 0.7, 1, 2, 2.5]
>>> longest_subarray_within_threshold(sorted_array, 0.2)
(0, 0)
>>> longest_subarray_within_threshold(sorted_array, 0.4)
(1, 2)
>>> longest_subarray_within_threshold(sorted_array, 0.8)
(0, 1)
>>> longest_subarray_within_threshold(sorted_array, 1)
(0, 2)
>>> longest_subarray_within_threshold(sorted_array, 1.9)
(1, 4)
>>> longest_subarray_within_threshold(sorted_array, 2)
(0, 3)
>>> longest_subarray_within_threshold(sorted_array, 2.6)
(0, 4)

Answer 1

这是Python中基于循环的简单解决方案：

def longest_subarray_within_threshold(sorted_array, threshold):
    result = (0, 0)
    longest = 0
    i = j = 0
    end = len(sorted_array)
    while i < end:
        if j < end and sorted_array[j] - sorted_array[i] <= threshold:
            current_distance = j - i
            if current_distance > longest:
                longest = current_distance
                result = (i, j)
            j += 1
        else:
            i += 1
    return result

Answer 2

最有可能的是，OP自己的答案是最好的算法，因为它是O（n）。然而，纯python开销使它非常慢。但是，使用numba编译算法可以很容易地减少这种开销，使用当前版本（撰写本文时为0.28.1），不需要任何手动输入，只需使用{{1来装饰您的函数就够了。

但是，如果您不想依赖numba，则O（n log n）中存在一个numpy算法：

def algo_burnpanck(sorted_array,thresh):
    limit = np.searchsorted(sorted_array,sorted_array+thresh,'right')
    distance = limit - np.arange(limit.size)
    best = np.argmax(distance)
    return best, limit[best]-1

我确实在我自己的机器上快速分析了前两个答案（OP＆＃39;和Divakar＆＃39;），以及我的numpy算法和OP＆n的算法的numba版本

thresh = 1
for n in [100, 10000]:
    sorted_array = np.sort(np.random.randn(n,))
    for f in [algo_user1475412,algo_Divakar,algo_burnpanck,algo_user1475412_numba]:
        a,b = f(sorted_array, thresh)
        d = b-a
        diff = sorted_array[b]-sorted_array[a]
        closestlonger = np.min(sorted_array[d+1:]-sorted_array[:-d-1])
        assert sorted_array[b]-sorted_array[a]<=thresh
        assert closestlonger>thresh
        print('f=%s, n=%d thresh=%s:'%(f.__name__,n,thresh))#,d,a,b,diff,closestlonger)
        %timeit f(sorted_array, thresh)

结果如下：

f=algo_user1475412, n=100 thresh=1:
10000 loops, best of 3: 111 µs per loop
f=algo_Divakar, n=100 thresh=1:
10000 loops, best of 3: 74.6 µs per loop
f=algo_burnpanck, n=100 thresh=1:
100000 loops, best of 3: 9.38 µs per loop
f=algo_user1475412_numba, n=100 thresh=1:
1000000 loops, best of 3: 764 ns per loop
f=algo_user1475412, n=10000 thresh=1:
100 loops, best of 3: 12.1 ms per loop
f=algo_Divakar, n=10000 thresh=1:
1 loop, best of 3: 1.76 s per loop
f=algo_burnpanck, n=10000 thresh=1:
1000 loops, best of 3: 308 µs per loop
f=algo_user1475412_numba, n=10000 thresh=1:
10000 loops, best of 3: 82.9 µs per loop

在100个数字处，使用numpy的O（n ^ 2）解决方案几乎没有击败O（n）python解决方案，但很快，缩放使得该算法无用。 O（n log n）甚至可以保持10000个数字，但numba方法在所有地方都保持不变。

Answer 3

这是使用broadcasting -

的矢量化方法

Leaflet.CRS

样品运行 -

def longest_thresh_subarray(sorted_array,thresh):
    diffs = (sorted_array[:,None] - sorted_array)
    r = np.arange(sorted_array.size)
    valid_mask = r[:,None] > r
    mask = (diffs <= thresh) & valid_mask
    bestcolID = (mask).sum(0).argmax()
    idx = np.nonzero(mask[:,bestcolID])[0]
    if len(idx)==0:
        out = (0,0)
    else:
        out = idx[0]-1, idx[-1]
    return out

如何在阈值内找到最长的子阵列？

3 个答案: