我想通过pandas .resample函数和'mean'方法将15分钟数据重新采样为60分钟数据,但是默认情况下,此方法采用原始值和3个下一个值的平均值。有没有办法代替原始值和前三个值的平均值?
输入数据(实际上输入数据是365天):
Generated On CB_P
2019-01-01 08:15:00+00:00 0.187
2019-01-01 08:30:00+00:00 0.228
2019-01-01 08:45:00+00:00 0.242
2019-01-01 09:00:00+00:00 0.8270
2019-01-01 09:15:00+00:00 1.083
2019-01-01 09:30:00+00:00 3.022
2019-01-01 09:45:00+00:00 1.511
2019-01-01 10:00:00+00:00 1.568
2019-01-01 10:15:00+00:00 6.365
2019-01-01 10:30:00+00:00 8.23
2019-01-01 10:45:00+00:00 9.3
2019-01-01 11:00:00+00:00 14.311
2019-01-01 11:15:00+00:00 13.045
2019-01-01 11:30:00+00:00 11.05
2019-01-01 11:45:00+00:00 11.257
2019-01-01 12:00:00+00:00 13.367
2019-01-01 12:15:00+00:00 11.895
2019-01-01 12:30:00+00:00 9.245
2019-01-01 12:45:00+00:00 7.254
2019-01-01 13:00:00+00:00 15.773
2019-01-01 13:15:00+00:00 14.280
2019-01-01 13:30:00+00:00 17.258
2019-01-01 13:45:00+00:00 7.792
2019-01-01 14:00:00+00:00 6.893
2019-01-01 14:15:00+00:00 4.693
2019-01-01 14:30:00+00:00 4.271
2019-01-01 14:45:00+00:00 1.524
2019-01-01 15:00:00+00:00 1.495
2019-01-01 15:15:00+00:00 1.03
2019-01-01 15:30:00+00:00 0.364
2019-01-01 15:45:00+00:00 0.045
预期输出:
Generated On CB_P
2019-01-01 09:00:00+00:00 0.371
2019-01-01 10:00:00+00:00 1.796
2019-01-01 11:00:00+00:00 9.5515
2019-01-01 12:00:00+00:00 12.180
2019-01-01 13:00:00+00:00 11.04
2019-01-01 14:00:00+00:00 11.556
2019-01-01 15:00:00+00:00 2.996
答案 0 :(得分:0)
尝试一下:
df.groupby(df['Generated On'].hour)[['CB_P']].mean()
答案 1 :(得分:0)
那呢?基本上,您对原始datetime
列应用了15分钟的偏移,然后对resample
应用了15分钟的偏移。您可以构建多个集合或自定义函数。
我正在使用pandas==1.1.3
。 df_Agg2
应该是您所追求的。
import pandas as pd
import scipy.stats as stats
from datetime import timedelta
df = pd.read_csv(r't1.csv')
df['Generated On'] = pd.to_datetime( df['Generated On'] )
df['datetime_Adj'] = df['Generated On'] - timedelta(minutes=15)
lambda0 = lambda x: stats.mode(x)[0]
lambda1 = lambda x: x.max() - x.min()
##########################################################################
df_Agg1 = df.resample(rule='1H', on='Generated On').apply({
'CB_P': ['sum', 'mean', 'min', 'max', lambda0, lambda1 ],
})
# Rename the columns
df_Agg1.columns = ['_'.join(pair) for pair in df_Agg1.columns]
df_Agg1.reset_index(inplace=True)
##########################################################################
df_Agg2 = df.resample(rule='1H', on='datetime_Adj').apply({
'CB_P': ['sum', 'mean', 'min', 'max', lambda0, lambda1 ],
})
# Rename the columns
df_Agg2.columns = ['_'.join(pair) for pair in df_Agg2.columns]
df_Agg2.reset_index(inplace=True)
##########################################################################
答案 2 :(得分:0)
告诉resample
在45分钟前开始每个周期:
r = df.resample('1H', offset=pd.Timedelta("-00:45:00")).mean()
这将产生正确的均值,但索引不正确(移位)。通过向前移动45分钟来修复它:
r.index += pd.Timedelta("00:45:00")
# CB_P
#Generated On
#2019-01-01 09:00:00+00:00 0.371000
#2019-01-01 10:00:00+00:00 1.796000
#2019-01-01 11:00:00+00:00 9.551500
#2019-01-01 12:00:00+00:00 12.179750
#2019-01-01 13:00:00+00:00 11.041750
#2019-01-01 14:00:00+00:00 11.555750
#2019-01-01 15:00:00+00:00 2.995750
#2019-01-01 16:00:00+00:00 0.479667