我有一个看起来像这样的数据框:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
我想创建一个新列,其中包含同一类别以前时间的平均值。如何创建它?
新列应如下所示:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
注意:如果是第一次,则平均值应为NaN。
编辑:正如cs95所说,我的问题与this one并不完全相同,因为在这里,需要扩展。
答案 0 :(得分:2)
“创建一个包含相同类别的先前时间平均值的新列”听起来像是GroupBy.expanding
(以及移位)的一个好用例:
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
另一种计算方法是不使用apply
(链接两个groupby
调用):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
在性能方面,这实际上取决于小组的数量和规模。
答案 1 :(得分:0)
受我的回答here的启发,可以先定义一个函数:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
所需的列是
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
性能方面,我认为它非常快。