优化数据框组-滚动-应用并添加回列初始数据框

时间:2019-02-07 09:51:34

标签: python pandas pandas-groupby

我有带有多个时间序列数据的DataFrame。虚拟示例:

df = pd.DataFrame({
    'node': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
    'seq': [0,1,2,3,4,5,6,7,8,9] * 2,
    'values': [1,2,4,8,16,32,64,128,256,512] * 2,
}).sample(frac=1).reset_index(drop=True) 

数据无序。这就是为什么我添加.sample(...).reset_index(...)对其进行混洗的原因。示例DataFrame看起来类似于:

    node  seq  values
0      2    0       1
1      1    5      32
2      2    3       8
3      1    9     512
4      1    4      16
5      2    2       4
6      1    2       4
7      1    7     128
8      1    6      64
9      1    0       1
10     2    9     512
11     2    1       2
12     1    8     256
13     1    1       2
14     2    5      32
15     2    7     128
16     1    3       8
17     2    6      64
18     2    8     256
19     2    4      16

在预处理阶段,示例中的这两个系列无关。现在,例如,我想添加一个具有滚动平均值的列。我现在如何做:

roll_mean = df.groupby('node', as_index=False) \
    .apply(lambda g: g.sort_values('seq')['values'].rolling(4).mean()) \
    .reset_index(level=0)['values']

# add column
df['rollMean4'] = roll_mean

有更好的方法吗?

1 个答案:

答案 0 :(得分:0)

您可以在groupby之前sort_values,因此应省略Apply:

np.random.seed(2019)

df = pd.DataFrame({
    'node': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
    'seq': [0,1,2,3,4,5,6,7,8,9] * 2,
    'values': [1,2,4,8,16,32,64,128,256,512] * 2,
}).sample(frac=1).reset_index(drop=True) 
#print (df)

如果对一列(Series)进行排序,则使用默认值quicksort

roll_mean = (df.groupby('node', as_index=False) 
               .apply(lambda g: g.sort_values('seq')['values'].rolling(4).mean()) 
               .reset_index(level=0)['values'])

对于真实数据中的相同值,需要设置kind='mergesort'

roll_mean = (df.groupby('node', as_index=False) 
               .apply(lambda g: g.sort_values('seq', kind='mergesort')['values']
                                  .rolling(4).mean()) 
               .reset_index(level=0)['values'])

如果使用排序多列,则默认为mergesort

roll_mean1 = (df.sort_values(['seq','node'])
               .groupby('node', as_index=False)['values']
               .rolling(4)
               .mean() 
               .reset_index(level=0, drop=True))


# add column
df['rollMean4'] = roll_mean
df['rollMean41'] = roll_mean1

print (df)
    node  seq  values  rollMean4  rollMean41
0      1    9     512     240.00      240.00
1      2    4      16       7.50        7.50
2      1    1       2        NaN         NaN
3      2    1       2        NaN         NaN
4      1    6      64      30.00       30.00
5      1    2       4        NaN         NaN
6      2    6      64      30.00       30.00
7      1    4      16       7.50        7.50
8      1    3       8       3.75        3.75
9      2    7     128      60.00       60.00
10     2    9     512     240.00      240.00
11     1    7     128      60.00       60.00
12     2    3       8       3.75        3.75
13     1    0       1        NaN         NaN
14     2    0       1        NaN         NaN
15     2    2       4        NaN         NaN
16     2    5      32      15.00       15.00
17     1    5      32      15.00       15.00
18     2    8     256     120.00      120.00
19     1    8     256     120.00      120.00

新样本:

np.random.seed(2019)

df = pd.DataFrame({
    'node': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2] * 10,
    'seq': [0,1,2,3,4,5,6,7,8,9] * 20,
    'values': [1,2,4,8,16,32,64,128,256,512] * 20,
}).sample(frac=1).reset_index(drop=True) 
print (df)

roll_mean1 = (df.groupby('node', as_index=False) 
               .apply(lambda g: g.sort_values('seq')['values'].rolling(4).mean()) 
               .reset_index(level=0)['values'])

roll_mean2 = (df.groupby('node', as_index=False) 
               .apply(lambda g: g.sort_values('seq', kind='mergesort')['values'].rolling(4).mean()) 
               .reset_index(level=0)['values'])

roll_mean3 = (df.sort_values(['seq','node'])
               .groupby('node', as_index=False)['values']
               .rolling(4)
               .mean() 
               .reset_index(level=0, drop=True))

# add column
df['rollMean41'] = roll_mean1
df['rollMean42'] = roll_mean2
df['rollMean43'] = roll_mean3
print (df.head())
   node  seq  values  rollMean41  rollMean42  rollMean43
0     2    3       8         5.0         5.0         5.0
1     2    7     128       128.0        80.0        80.0
2     2    0       1         1.0         NaN         NaN
3     2    8     256       160.0       160.0       160.0
4     1    5      32        20.0        20.0        20.0