Question

我计算一个时间序列的值（通过 myvalues 表示）。下面的代码标识了事件发生的位置（ cross_indices ），然后计算最后8个事件（ n_crosses ）。相对于每行时间的第8个交叉的索引在系列 max_lookback 中设置。

总代码只需约0.5秒即可设置 max_lookback 。但是，当我运行 pd.apply（）以从当前索引到 max_lookback 获得 myvalues 的最小值和最大值时，代码需要〜跑22秒。

我认为 apply（）应该比for循环更快地遍历行。为什么代码需要这么长时间来执行，以及如何才能大大加快它的速度呢？

程序输出

minmax的总时间为22.469秒

总运行时间为22.93秒

import pandas as pd
import numpy as np
import timeit

complete_start = timeit.default_timer()
indices = pd.Series( range(20000), name='Index')
sample_from = np.append(np.zeros(9), 1) #10% odds of selecting 1
cross = pd.Series( np.random.choice( sample_from, size=len(indices) ), name='Cross' )
#cross = pd.Series( 
cross_indices = np.flatnonzero( cross )
n_crosses = 8

def set_max_lookback(index):
        sub = cross_indices[ cross_indices <= index ]    
        #get integer index where crosses occurred

        if len( sub ) < n_crosses:
            return int( 0 )

        return int( sub[ len(sub) - n_crosses ] )

max_lookback = pd.Series( indices.apply( set_max_lookback ), name='MaxLookback' )

start = timeit.default_timer()
myvalues = pd.Series( np.random.randint(-100,high=100, size=len(indices) ), name='Random' )

def minmax_of_zero_crosses(index):

     sub = myvalues.iloc[ range( max_lookback[index], index+1 ) ]
     return ( sub.min(), sub.max() )

minmax_as_tuple_series = pd.Series( indices.apply( minmax_of_zero_crosses ), name='Min' )
minmax_df = pd.DataFrame( minmax_as_tuple_series.tolist() )
minmax_df.columns = [ 'Min', 'Max' ]
maxz = minmax_df['Max']
minz = minmax_df['Min']
end = timeit.default_timer()
print('total time of minmax is ' + str(end-start) + ' seconds.')
complete_end = timeit.default_timer()
print('total runtime is ' + str(complete_end-complete_start) + ' seconds.')

编辑1

根据Mitch的评论，我仔细检查了max_lookback设置。使用n_crosses = 3，您可以看到为行19,995选择了正确的索引19,981。图片中未显示的列标签是index，myvalues，cross，max_lookback。

df = pd.DataFrame([myvalues, cross, max_lookback, maxz, minz ] ).transpose()
print(df.tail(n=60))

以图像为例，对于第19,999行，我想在第19,981行（max_lookback列）和19,999之间找到myvalues的最小值/最大值，即-95和+97。

Answer 1

apply实际上通常不是一个非常有效的解决方案，因为它实际上只是一个for循环本身。

矢量化方法：

indices = pd.Series(range(20000))
sample_from = np.append(np.zeros(9), 1) #10% odds of selecting 1
cross = pd.Series(np.random.choice(sample_from, size=indices.size))
myvalues = pd.DataFrame(dict(Random=np.random.randint(-100, 
                                                      100,                       
                                                      size=indices.size)))

n_crosses = 8
nonzeros = cross.nonzero()[0]
diffs = (nonzeros-np.roll(nonzeros, n_crosses-1)).clip(0)
myvalues['lower'] = np.nan
myvalues.loc[nonzeros, 'lower'] = diffs
myvalues.lower = ((myvalues.index.to_series() - myvalues.lower)
                   .fillna(method='ffill')
                   .fillna(0).astype(np.int))
myvalues.loc[:(cross.cumsum() < n_crosses).sum()+1, 'lower'] = 0

reducer = np.empty((myvalues.shape[0]*2,), dtype=myvalues.lower.dtype)
reducer[::2] = myvalues.lower.values
reducer[1::2] = myvalues.index.values + 1
myvalues.loc[myvalues.shape[0]] = [0,0]
minmax_df = pd.DataFrame(
    {'min':np.minimum.reduceat(myvalues.Random.values, reducer)[::2],
     'max':np.maximum.reduceat(myvalues.Random.values, reducer)[::2]}
)

这会产生与当前解决方案相同的最小/最大DataFrame。基本思想是为myvalues中的每个索引生成最小值/最大值的边界，然后使用ufunc.reduceat计算这些最小值/最大值。

在我的机器上，你的当前解决方案每个循环需要~8.1 b ，而上述解决方案每个循环需要~7.9 b ，大约1025％的加速。

Answer 2

这个答案是基于Mitch出色的工作。我在代码中添加了注释，因为我花了大量时间来理解解决方案。我也发现了一些小问题。

解决方案取决于numpy的reduceat函数。

import pandas as pd
import numpy as np

indices = pd.Series(range(20000))
sample_from = np.append(np.zeros(2), 1) #10% odds of selecting 1
cross = pd.Series(np.random.choice(sample_from, size=indices.size))
myvalues = pd.DataFrame(dict(Random=np.random.randint(-100, 
                                                      100,                       
                                                      size=indices.size)))

n_crosses = 3

#eliminate nonzeros to speed up processing
nonzeros = cross.nonzero()[0]

#find the number of rows between each cross
diffs = (nonzeros-np.roll(nonzeros, n_crosses-1)).clip(0)

myvalues['lower'] = np.nan
myvalues.loc[nonzeros, 'lower'] = diffs

#set the index where a cross occurred
myvalues.lower = myvalues.index.to_series() - myvalues.lower

#fill the NA values with the previous cross index
myvalues.lower = myvalues.lower.fillna(method='ffill')
#fill the NaN values at the top of the series with 0
myvalues.lower = myvalues.lower.fillna(0).astype(np.int)

#set lower to 0 where crossses < n_crosses at the head of the Series
myvalues.loc[:(cross.cumsum() < n_crosses).sum()+1, 'lower'] = 0

#create a numpy array that lists the start and end index of events for each
# row in alternating order
reducer = np.empty((myvalues.shape[0]*2,), dtype=myvalues.lower.dtype)
reducer[::2] = myvalues.lower
reducer[1::2] = indices+1
reducer[len(reducer)-1] = indices[len(indices)-1]


myvalues['Cross'] = cross

#use reduceat to dramatically lower total execution time
myvalues['MinZ'] = np.minimum.reduceat( myvalues.iloc[:,0], reducer )[::2]
myvalues['MaxZ'] = np.maximum.reduceat( myvalues.iloc[:,0], reducer )[::2]

lastRow = len(myvalues)-1

#reduceat does not correctly identify the minimumu and maximum on the last row
#if a new min/max occurs on that row. This is a manual override

if myvalues.ix[lastRow,'MinZ'] >= myvalues.iloc[lastRow, 0]:
    myvalues.ix[lastRow,'MinZ'] = myvalues.iloc[lastRow, 0]

if myvalues.ix[lastRow,'MaxZ'] <= myvalues.iloc[lastRow, 0]:
    myvalues.ix[lastRow,'MaxZ'] = myvalues.iloc[lastRow, 0]    

print( myvalues.tail(n=60) )

使用min / max应用的pandas在变量滚动窗口上缓慢执行

2 个答案: