我有带有多个时间序列数据的DataFrame。虚拟示例:
df = pd.DataFrame({
'node': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'seq': [0,1,2,3,4,5,6,7,8,9] * 2,
'values': [1,2,4,8,16,32,64,128,256,512] * 2,
}).sample(frac=1).reset_index(drop=True)
数据无序。这就是为什么我添加.sample(...)
和.reset_index(...)
对其进行混洗的原因。示例DataFrame看起来类似于:
node seq values
0 2 0 1
1 1 5 32
2 2 3 8
3 1 9 512
4 1 4 16
5 2 2 4
6 1 2 4
7 1 7 128
8 1 6 64
9 1 0 1
10 2 9 512
11 2 1 2
12 1 8 256
13 1 1 2
14 2 5 32
15 2 7 128
16 1 3 8
17 2 6 64
18 2 8 256
19 2 4 16
在预处理阶段,示例中的这两个系列无关。现在,例如,我想添加一个具有滚动平均值的列。我现在如何做:
roll_mean = df.groupby('node', as_index=False) \
.apply(lambda g: g.sort_values('seq')['values'].rolling(4).mean()) \
.reset_index(level=0)['values']
# add column
df['rollMean4'] = roll_mean
有更好的方法吗?
答案 0 :(得分:0)
您可以在groupby
之前sort_values
,因此应省略Apply:
np.random.seed(2019)
df = pd.DataFrame({
'node': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'seq': [0,1,2,3,4,5,6,7,8,9] * 2,
'values': [1,2,4,8,16,32,64,128,256,512] * 2,
}).sample(frac=1).reset_index(drop=True)
#print (df)
如果对一列(Series
)进行排序,则使用默认值quicksort
:
roll_mean = (df.groupby('node', as_index=False)
.apply(lambda g: g.sort_values('seq')['values'].rolling(4).mean())
.reset_index(level=0)['values'])
对于真实数据中的相同值,需要设置kind='mergesort'
:
roll_mean = (df.groupby('node', as_index=False)
.apply(lambda g: g.sort_values('seq', kind='mergesort')['values']
.rolling(4).mean())
.reset_index(level=0)['values'])
如果使用排序多列,则默认为mergesort
:
roll_mean1 = (df.sort_values(['seq','node'])
.groupby('node', as_index=False)['values']
.rolling(4)
.mean()
.reset_index(level=0, drop=True))
# add column
df['rollMean4'] = roll_mean
df['rollMean41'] = roll_mean1
print (df)
node seq values rollMean4 rollMean41
0 1 9 512 240.00 240.00
1 2 4 16 7.50 7.50
2 1 1 2 NaN NaN
3 2 1 2 NaN NaN
4 1 6 64 30.00 30.00
5 1 2 4 NaN NaN
6 2 6 64 30.00 30.00
7 1 4 16 7.50 7.50
8 1 3 8 3.75 3.75
9 2 7 128 60.00 60.00
10 2 9 512 240.00 240.00
11 1 7 128 60.00 60.00
12 2 3 8 3.75 3.75
13 1 0 1 NaN NaN
14 2 0 1 NaN NaN
15 2 2 4 NaN NaN
16 2 5 32 15.00 15.00
17 1 5 32 15.00 15.00
18 2 8 256 120.00 120.00
19 1 8 256 120.00 120.00
新样本:
np.random.seed(2019)
df = pd.DataFrame({
'node': [1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2] * 10,
'seq': [0,1,2,3,4,5,6,7,8,9] * 20,
'values': [1,2,4,8,16,32,64,128,256,512] * 20,
}).sample(frac=1).reset_index(drop=True)
print (df)
roll_mean1 = (df.groupby('node', as_index=False)
.apply(lambda g: g.sort_values('seq')['values'].rolling(4).mean())
.reset_index(level=0)['values'])
roll_mean2 = (df.groupby('node', as_index=False)
.apply(lambda g: g.sort_values('seq', kind='mergesort')['values'].rolling(4).mean())
.reset_index(level=0)['values'])
roll_mean3 = (df.sort_values(['seq','node'])
.groupby('node', as_index=False)['values']
.rolling(4)
.mean()
.reset_index(level=0, drop=True))
# add column
df['rollMean41'] = roll_mean1
df['rollMean42'] = roll_mean2
df['rollMean43'] = roll_mean3
print (df.head())
node seq values rollMean41 rollMean42 rollMean43
0 2 3 8 5.0 5.0 5.0
1 2 7 128 128.0 80.0 80.0
2 2 0 1 1.0 NaN NaN
3 2 8 256 160.0 160.0 160.0
4 1 5 32 20.0 20.0 20.0