值与过去窗口之间的滚动比较,具有百分位数/分位数

时间:2018-11-03 10:26:42

标签: python arrays numpy data-analysis moving-average

我想将数组的每个值x与n个先前值的滚动窗口进行比较。更准确地说,如果我们将新值x添加到上一个窗口中,我想将其设置为哪个百分比

import numpy as np
A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15])
print A
n = 4  # window width
for i in range(len(A)-n):
    W = A[i:i+n]
    x = A[i+n]
    q = sum(W <= x) * 1.0 / n
    print 'Value:', x, ' Window before this value:', W, ' Quantile:', q
  

[1. 4. 9. 28. 28.5 2. 283. 3.2 7. 15.]
  值:28.5此值之前的窗口:[1. 4. 9. 28.]分位数:1.0
  值:2.0此值之前的窗口:[4. 9. 28. 28.5]分位数:0.0
  值:283.0此值之前的窗口:[9. 28. 28.5 2.]分位数:1.0
  值:3.2此值之前的窗口:[28. 28.5 2. 283.]分位数:0.25
  值:7.0此值之前的窗口:[28.5 2. 283. 3.2]分位数:0.5
  值:15.0此值之前的窗口:[2. 283. 3.2 7.]分位数:0.75

问题:此计算的名称是什么?是否有一种聪明的numpy方法可以在数百万个项目(n可以为〜5000)的数组上更有效地进行计算?


注意:这是一个模拟100万个商品且n = 5000的模拟,但需要大约2个小时:

import numpy as np
A = np.random.random(1000*1000)  # the following is not very interesting with a [0,1]
n = 5000                         # uniform random variable, but anyway...
Q = np.zeros(len(A)-n)
for i in range(len(Q)):
    Q[i] = sum(A[i:i+n] <= A[i+n]) * 1.0 / n
    if i % 100 == 0: 
        print "%.2f %% already done. " % (i * 100.0 / len(A))

print Q

注意:这与How to compute moving (or rolling, if you will) percentile/quantile for a 1d array in numpy?

不同

5 个答案:

答案 0 :(得分:2)

您的代码太慢了,因为您使用的是Python自己的sum()而不是numpy.sum()numpy.array.sum(); Python的sum()必须在进行计算之前将所有原始值转换为Python对象,这确实很慢。只需将sum(...)更改为np.sum(...)(...).sum(),运行时间就会减少到20秒以内。

答案 1 :(得分:1)

您可以像链接的问题的accepted answer一样使用np.lib.stride_tricks.as_strided。通过您给出的第一个示例,它很容易理解:

A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15])
n=4
print (np.lib.stride_tricks.as_strided(A, shape=(A.size-n,n),
                                       strides=(A.itemsize,A.itemsize)))
# you get the A.size-n columns of the n rolling elements
array([[  1. ,   4. ,   9. ,  28. ,  28.5,   2. ],
       [  4. ,   9. ,  28. ,  28.5,   2. , 283. ],
       [  9. ,  28. ,  28.5,   2. , 283. ,   3.2],
       [ 28. ,  28.5,   2. , 283. ,   3.2,   7. ]])

现在要进行计算,您可以将该数组与行上的A [n:],sum比较并除以n

print ((np.lib.stride_tricks.as_strided(A, shape=(n,A.size-n),
                                        strides=(A.itemsize,A.itemsize)) 
          <= A[n:]).sum(0)/(1.*n))
[1.   0.   1.   0.25 0.5  0.75] # same anwser

现在的问题是您的数据大小(几个M和n大约为5000),不确定是否可以直接使用此方法。一种方法是对数据进行分块。让我们定义一个函数

def compare_strides (arr, n):
   return (np.lib.stride_tricks.as_strided(arr, shape=(n,arr.size-n),
                                           strides=(arr.itemsize,arr.itemsize)) 
            <= arr[n:]).sum(0)

使用np.concatenate进行处理,不要忘记除以n

nb_chunk = 1000 #this number depends on the capacity of you computer, 
                # not sure how to optimize it
Q = np.concatenate([compare_strides(A[chunk*nb_chunk:(chunk+1)*nb_chunk+n],n) 
                    for chunk in range(0,A[n:].size/nb_chunk+1)])/(1.*n)

我无法进行1M-5000测试,但是在5000-100上,请查看timeit中的区别:

A = np.random.random(5000)
n = 100

%%timeit
Q = np.zeros(len(A)-n)
for i in range(len(Q)):
    Q[i] = sum(A[i:i+n] <= A[i+n]) * 1.0 / n

#1 loop, best of 3: 6.75 s per loop

%%timeit
nb_chunk = 100
Q1 = np.concatenate([compare_strides(A[chunk*nb_chunk:(chunk+1)*nb_chunk+n],n) 
                    for chunk in range(0,A[n:].size/nb_chunk+1)])/(1.*n)

#100 loops, best of 3: 7.84 ms per loop

#check for egality
print ((Q == Q1).all())
Out[33]: True

查看从6750毫秒到7.84毫秒的时间差。希望它能处理更大的数据

答案 2 :(得分:1)

已经提到使用np.sum而不是sum,所以我唯一的建议是另外考虑使用pandas及其滚动窗口函数,您可以将任何任意函数应用于:

import numpy as np
import pandas as pd

A = np.random.random(1000*1000)
df = pd.DataFrame(A)
n = 5000

def fct(x):
    return np.sum(x[:-1] <= x[-1]) * 1.0 / (len(x)-1)

percentiles = df.rolling(n+1).apply(fct)
print(percentiles)

答案 3 :(得分:1)

其他基准:this solutionthis solution之间的比较:

import numpy as np, time

A = np.random.random(1000*1000)
n = 5000

def compare_strides (arr, n):
   return (np.lib.stride_tricks.as_strided(arr, shape=(n,arr.size-n), strides=(arr.itemsize,arr.itemsize)) <= arr[n:]).sum(0)

# Test #1: with strides ===> 11.0 seconds
t0 = time.time()
nb_chunk = 10*1000
Q = np.concatenate([compare_strides(A[chunk*nb_chunk:(chunk+1)*nb_chunk+n],n) for chunk in range(0,A[n:].size/nb_chunk+1)])/(1.*n)
print time.time() - t0, Q

# Test #2: with just np.sum ===> 18.0 seconds
t0 = time.time()
Q2 = np.zeros(len(A)-n)
for i in range(len(Q2)):
    Q2[i] = np.sum(A[i:i+n] <= A[i+n])
Q2 *= 1.0 / n  # here the multiplication is vectorized; if instead, we move this multiplication to the previous line: np.sum(A[i:i+n] <= A[i+n]) * 1.0 / n, it is 6 seconds slower
print time.time() - t0, Q2

print all(Q == Q2)

还有另一种(更好)的方式,使用numba@jit装饰器。然后它会更快:仅5.4秒

from numba import jit
import numpy as np

@jit  # if you remove this line, it is much slower (similar to Test #2 above)
def doit():
    A = np.random.random(1000*1000)
    n = 5000
    Q2 = np.zeros(len(A)-n)
    for i in range(len(Q2)):
        Q2[i] = np.sum(A[i:i+n] <= A[i+n])
    Q2 *= 1.0/n
    print(Q2)

doit()

添加numba并行化时,它甚至更快: 1.8秒!

import numpy as np
from numba import jit, prange

@jit(parallel=True)
def doit(A, Q, n):
    for i in prange(len(Q)):
        Q[i] = np.sum(A[i:i+n] <= A[i+n])

A = np.random.random(1000*1000)
n = 5000
Q = np.zeros(len(A)-n)    
doit(A, Q, n)

答案 4 :(得分:-1)

您可以使用np.quantile代替sum(A[i:i+n] <= A[i+n]) * 1.0 / n。那可能就足够了。不知道您的问题是否真的有更好的方法。