Numpy矢量化算法,用于查找大于当前元素的第一个未来元素

时间:2011-10-25 22:52:31

标签: python numpy vectorization

我有一个时间序列A.我想生成另一个时间序列B,例如

B [i] = j,其中j是大于i的第一个索引,使得A [j]> A [i]中。

有一种快速的方式在numpy中这样做吗?

感谢。

[EDITED]:最好只使用O(n)空间。

2 个答案:

答案 0 :(得分:9)

测试不充分,请自担风险。

import numpy

a = numpy.random.random(100)

# a_by_a[i,j] = a[i] > a[j]
a_by_a = a[numpy.newaxis,:] > a[:,numpy.newaxis]
# by taking the upper triangular, we ignore all cases where i < j
a_by_a = numpy.triu(a_by_a)
# argmax will give the first index with the highest value (1 in this case)
print numpy.argmax(a_by_a, axis = 1)

降低内存版本:

a = numpy.random.random(100)

@numpy.vectorize
def values(i):
    try:
        return (a[i:] > a[i]).nonzero()[0][0] + i
    except IndexError:
        return -1 # no valid values found

b = values(numpy.arange(100))

更快的版本:

@np.vectorize
def values(i):
    try:
        return next(idx for idx, value in enumerate(A[i+1:]) if value > A[i]) + i + 1
    except StopIteration:
        return -1 # no valid values found
return values(np.arange(len(A)))

更快的版本:

def future6(A):
    # list of tuples (index into A, value in A)
    # invariant: indexes and values in sorted order
    known = []
    result = []
    for idx in xrange(len(A) - 1, -1, -1):
        value = A[idx]
        # since known is sorted a binary search could be applied here
        # I haven't bothered

        # anything lower then the current value
        # cannot possibly be used again, since this value will be first index instead
        # of those
        known = [(x,y) for x,y in known if y > value]


        if known: 
            # all values in known are > current value
            # they reverse sorted by index               
            # the value with the lowest index is first
            result.append(known[-1][0])
        else:
            # no values exist this high, report -1
            result.append(-1)
        # add to end of the list to maintain invariant
        known.append( (idx, value) )

    # let numpy worry about reversing the array
    return np.array(result)[::-1]

对于此处使用的一些想法,请向cyborg表示感谢。

算法差异

cyborg显示各种算法之间存在显着差异,具体取决于输入数据的数据。我从运行的算法中收集了一些统计信息,看看我是否能弄清楚发生了什么。

随机数据:

Average distance between value and its target: 9
Average length of ascends list: 24
Average length of segment in ascends list: 1.33
Average length of known list: 9.1

由于列表非常短,因此上升算法主要衰减为线性搜索。它确实清除了将来无法使用的上升,因此它仍然明显优于线性搜索。

振荡:

Average distance between value and its target: 31.46
Average length of ascends list: 84
Average length of segment in ascends list: 1.70
Average length of known list: 57.98

振荡倾向于将不同的碎片分开。这自然妨碍了线性搜索算法。两种“更智能”的算法都必须跟踪其他数据。我的算法大大衰减,因为它每次都会扫描数据。上升算法触及的数据较少,效果更好。

升序:

Average distance between value and its target: 2.57
Average length of ascends list: 40
Average length of segment in ascends list: 3.27
Average length of known list: 3037.97

很清楚为什么我的算法有问题,它必须跟踪大量的升序值。值和目标之间的短距离解释了线性搜索的良好性能。 Ascends仍然没有与很长的细分市场合作。

更好的算法

我的算法没有理由对数据进行线性搜索。它已经过排序,我们只需要从列表末尾删除小值。

def future6(A):
    # list of tuples (index into A, value in A)
    # invariant: indexes and values in sorted order
    known = []
    result = []
    for idx in xrange(len(A) - 1, -1, -1):
        value = A[idx]
        # since known is sorted a binary search could be applied here
        # I haven't bothered

        # anything lower then the current value
        # cannot possibly be used again, since this value will be first index instead
        # of those
        while known and known[-1][1] < value:
            known.pop()


        if known: 
            # all values in known are > current value
            # they reverse sorted by index               
            # the value with the lowest index is first
            result.append(known[-1][0])
        else:
            # no values exist this high, report -1
            result.append(-1)
        # add to end of the list to maintain invariant
        known.append( (idx, value) )

    # let numpy worry about reversing the array
    return np.array(result)[::-1]

但我发现我们可以重用以前的B计算值而不是构建新的数据结构。如果j>我,A [i]&gt; A [j]然后是B [i]&gt; B [j]的。

def future8(A):
    B = [-1] * len(A)
    for index in xrange(len(A)-2, -1, -1):
        target = index + 1
        value = A[index]
        while target != -1 and A[target] < value:
            target = B[target]
        B[index] = target
    return np.array(B)

我的基准测试结果:

Random series:
future2 ascends  : 0.242569923401
future6 full list: 0.0363488197327
future7 vectorize: 0.129994153976
future8 reuse: 0.0299410820007
Oscillating series:
future2 ascends  : 0.233623981476
future6 full list: 0.0360488891602
future7 vectorize: 1.19140791893
future8 reuse: 0.0297570228577
Ascending trend series:
future2 ascends  : 0.120707035065
future6 full list: 0.0314049720764
future7 vectorize: 0.0640320777893
future8 reuse: 0.0246520042419

升序版

Cyborg有一个非常有趣的想法,可以利用升序细分。我不认为他的任何测试案例真的表现出他在那之后的行为。我不认为这些部分足够长,可以利用。但我认为真实的数据可能会有这样的部分,所以利用它会非常有用。

但我认为它不会起作用。准备必要的数据进行二进制搜索需要花费O(n)的时间。如果我们多次进行二分搜索,那就没问题,但是一旦我们在升序部分的中间找到一个值,我们就再也不会重新访问左边的任何内容了。因此,即使使用二进制搜索,我们也花费最多O(n)时间来处理数据。

如果构建必要数据的成本较低,那么稍后扫描升序部分就可以了。但扫描非常便宜,你很难想出一种方法来处理价格较低的升序部分。

答案 1 :(得分:2)

@Winston Ewert的future8方法是O(n)(!),并且优于我们之前的所有提议。为了证明它是O(n),观察内部while循环对B[target]的任何值最多执行一次。

我的回答是:

这是三种方法的基准(在@Winston Ewert和我之间的乒乓球之后):

  1. 二进制搜索的升序列表。 (future2)
  2. 完整列表(future6,@ Winston Ewert)
  3. numpy.vectorize(future7,@ Winston Ewert未来的增强版本5)。
  4. 在不同条件下,每一项都明显快于其他项目。如果该系列是随机的,那么“完整列表”(future6)是最快的。如果系列振荡,则“升序列表”(future2)最快。如果系列倾向于提升,那么矢量化(future7)是最快的。

    如果数据是股票报价,我会选择“向量化”(future7),因为股票具有上升趋势,并且因为它很简单并且在所有条件下都能合理地执行。

    输出:

    Random series:
    future2 ascends  : 0.210215095646
    future6 full list: 0.0920153693145
    future7 vectorize: 0.138747922771
    Oscillating series:
    future2 ascends  : 0.208349650159
    future6 full list: 0.940276050999
    future7 vectorize: 0.597290143496
    Ascending trend series:
    future2 ascends  : 0.131106233627
    future6 full list: 20.7201531342
    future7 vectorize: 0.0540951244451
    

    代码:

    import numpy as np
    import time 
    import timeit
    
    def future2(A):    
        def reverse_enum(L):
            for index in reversed(xrange(len(L))):
                yield len(L)-index-1, L[index]
        def findnext(x, A, ascends): # find index of first future number greater than x
            for idx, segment in reverse_enum(ascends):
                joff=A[segment[0]:segment[1]+1].searchsorted(x,side='right') # binary search
                if joff < (segment[1]-segment[0]+1):
                    j=segment[0]+joff
                    [ascends.pop() for _ in range(idx)] # delete previous segments
                    segment[0]=j # cut beginning of segment 
                    return j
            return -1
        B = np.arange(len(A))+1
        # Note: B[i]=-1 where there is no greater value in the future.
        B[-1] = -1 # put -1 at the end
        ascends = [] # a list of pairs of indexes, ascending segments of A
        maximum = True
        for i in xrange(len(A)-2,-1,-1): # scan backwards
            #print(ascends)
            if A[i] < A[i+1]:
                if maximum:
                    ascends.append([i+1,i+1])
                    maximum = False
                else:
                    ascends[-1][0] = i+1
            else:# A[i] >= A[i+1]
                B[i] = findnext(A[i], A, ascends)
                maximum = True
        return B
    
    
    def future6(A):
        # list of tuples (index into A, value in A)
        # invariant: indexes and values in sorted order
        known = []
        result = []
        for idx in xrange(len(A) - 1, -1, -1):
            value = A[idx]
            # since known is sorted a binary search could be applied here
            # I haven't bothered
    
            # anything lower then the current value
            # cannot possibly be used again, since this value will be first index instead
            # of those
            known = [(x,y) for x,y in known if y > value]
    
    
            if known: 
                # all values in known are > current value
                # they reverse sorted by index               
                # the value with the lowest index is first
                result.append(known[-1][0])
            else:
                # no values exist this high, report -1
                result.append(-1)
            # add to end of the list to maintain invariant
            known.append( (idx, value) )
    
        # let numpy worry about reversing the array
        return np.array(result)[::-1]
    
    
    def future7(A):
        @np.vectorize
        def values(i):
            for idx, v in enumerate(A[i+1:]): # loop is faster than genexp with exception
                if A[i]<v:
                    return idx+i+1
            return -1
        return values(np.arange(len(A)))
    
    if __name__ == '__main__':
        print('Random series:')
        tsetup = """import future; import numpy; A = numpy.random.random(1e4)"""
        t = timeit.timeit('future.future2(A)', tsetup, number=3)
        print('future2 ascends  : '+str(t))
        t = timeit.timeit('future.future6(A)', tsetup, number=3)
        print('future6 full list: '+str(t))
        t = timeit.timeit('future.future7(A)', tsetup, number=3)
        print('future7 vectorize: '+str(t))
    
        print('Oscillating series:')
        tsetup = """import future; import numpy; A = numpy.random.randint(1e5,size=1e4)-5e4; A = A.cumsum()"""
        t = timeit.timeit('future.future2(A)', tsetup, number=3)
        print('future2 ascends  : '+str(t))
        t = timeit.timeit('future.future6(A)', tsetup, number=3)
        print('future6 full list: '+str(t))
        t = timeit.timeit('future.future7(A)', tsetup, number=3)
        print('future7 vectorize: '+str(t))
    
        print('Ascending trend series:')
        tsetup = """import future; import numpy; A = numpy.random.randint(1e5,size=1e4)-3e4; A = A.cumsum()"""
        t = timeit.timeit('future.future2(A)', tsetup, number=3)
        print('future2 ascends  : '+str(t))
        t = timeit.timeit('future.future6(A)', tsetup, number=3)
        print('future6 full list: '+str(t))
        t = timeit.timeit('future.future7(A)', tsetup, number=3)
        print('future7 vectorize: '+str(t))