Question

我有一个DataFrame，其中有三列代表一个组，时间和一个值。我想计算每组内的滚动平均值，标准偏差等。现在我定义一个函数并使用apply。但是，对于非常大的数据集，这非常慢。以下是该功能。

def GetRollingMetrics(x, cols, windows, suffix):
    for col in cols:
        for win in windows:
            x[col + '_' + str(win) + 'D' + '_mean' + '_' + suffix] = x.shift(1).rolling(win)[col].mean()
            x[col + '_' + str(win) + 'D' + '_std' + '_' + suffix] = x.shift(1).rolling(win)[col].std()
            x[col + '_' + str(win) + 'D' + '_min' + '_' + suffix] = x.shift(1).rolling(win)[col].min()
            x[col + '_' + str(win) + 'D' + '_max' + '_' + suffix] = x.shift(1).rolling(win)[col].max()

    return x

然后应用它，作为一个例子，我使用：

df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 3)), columns=['Group','Time','Value'])
df.sort_values(by='Time', inplace=True)
df = df.groupby('Group').apply(lambda x: GetRollingMetrics(x, ['Value'], [7,14,28], 'my_suffix'))

是否还有更多的熊猫＆＃39;或有效的方法来做到这一点？

Answer 1

我不确定速度，但您绝对可以在pd.concat使用df.apply。此外，您可以并行计算所有列的滚动统计信息。你不必一次只做一列。

import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 3)), 
                     columns=['Group','Time','Value'])
df.sort_values(by='Time', inplace=True)

suffix = 'my_suffix'
windows = [7, 14, 28]
df = df.groupby('Group')

d1 = pd.concat([df.rolling(w).mean()\
                  .rename(columns=lambda x: x + '_' + str(w) + 'D_mean_' + suffix)\
               for w in windows] , 1)
d2 = pd.concat([df.rolling(w).std()\
                  .rename(columns=lambda x: x + '_' + str(w) + 'D_std_' + suffix) \
               for w in windows] , 1)
d3 = pd.concat([df.rolling(w).min()\
                  .rename(columns=lambda x: x + '_' + str(w) + 'D_min_' + suffix) \
               for w in windows] , 1)
d4 = pd.concat([df.rolling(w).max()\
                  .rename(columns=lambda x: x + '_' + str(w) + 'D_max_' + suffix) \
               for w in windows] , 1)

out = pd.concat([d1, d2, d3, d4], 1)

<强>性能

1 loop, best of 3: 9.9 s per loop

Answer 2

我重构了你的函数以使用agg()，因此我们可以一次性为每个窗口准备所有数据：

def GetRollingMetrics(x, cols, windows, suffix):
    for win in windows:
        aggs = {col: ['mean', 'std', 'min', 'max'] for col in cols}
        df = x.shift(1).rolling(win).agg(aggs)
        # the real work is done, just copy the columns into x
        for col in cols:
            prefix = col + '_' + str(win) + 'D'
            for stat in ('mean', 'std', 'min', 'max'):
                x['_'.join((prefix, stat, suffix))] = df[col][stat]
    return x

如果您有多列，速度会更快。如果您只有一列，它似乎不会快得多。在for stat循环中肯定有改进空间 - 复制大约需要一半的时间。可能你可以改为重命名，也许可以在以后连接结果？

如果你迫切希望加快速度，你应该考虑使用Numba，它可以让你实现一次通过最小/最大/总和，然后你可以用它进行所有滚动计算。我之前已经完成了这项工作，你可以在不太多的时间内完成所有四项计算（因为昂贵的部分是将数据加载到缓存中）。

加快在分组的pandas数据帧中计算滚动均值/标准

2 个答案: