尝试弄清楚如何使用滚动平均值,在计算统计数据之前考虑日期和小时。
文件看起来像这样:
date hour price
1/1/2016 1 a
1/1/2016 2 b
. . .
. . .
1/8/2016 1 c
1/8/2016 2 d
. . .
. . .
1/15/2016 1 e
1/15/2016 2 f
虽然输出列应该是这样的。
date hour price ma
1/1/2016 1 a
1/1/2016 2 b
. . .
. . .
1/8/2016 1 c
1/8/2016 2 d
. . .
. . .
1/15/2016 1 e mean(a,c)
1/15/2016 2 f mean(b,d)
答案 0 :(得分:1)
这不是100%清楚你想要什么,但这是我做出的假设......
您想要在特定日期之前的所有日期的小时平均值。这段代码就是这样......
import pandas as pd
import numpy as np
import datetime
# build a sample table
np.random.seed(1)
values = np.random.choice(range(1, 11), 25)
dates = np.random.choice(pd.date_range(datetime.date(2016, 1, 1), datetime.date(2016, 1, 4)), 25)
hours = np.random.choice(range(4), 25)
df = pd.DataFrame({'date': dates, 'hour': hours, 'value': values})
df看起来像这样......
date hour value
0 2016-01-03 1 6
1 2016-01-01 2 9
2 2016-01-03 2 10
3 2016-01-02 0 6
4 2016-01-03 3 1
5 2016-01-01 3 1
6 2016-01-04 1 2
7 2016-01-01 1 8
8 2016-01-03 3 7
9 2016-01-01 2 10
现在进行转型......
df.sort_values(['date', 'hour'], inplace=True)
groups = df.groupby(['hour'])
# calculate the rolling mean and sub out the current day's value then...
# divide by count of previous observations (works because cumcount is base 0)
df['rolling_mean'] = (groups.transform(np.cumsum)['value'] - df.value) / (groups.cumcount())
# just to show result
df.sort_values(['hour', 'date'])
结果是......
date hour value rolling_mean
3 2016-01-02 0 6 NaN
7 2016-01-01 1 8 NaN
0 2016-01-03 1 6 8.0
6 2016-01-04 1 2 7.0
1 2016-01-01 2 9 NaN
9 2016-01-01 2 10 9.0
2 2016-01-03 2 10 9.5
5 2016-01-01 3 1 NaN
4 2016-01-03 3 1 1.0
8 2016-01-03 3 7 1.0
由你来做你想要的NaNs ......
答案 1 :(得分:0)
请稍加保留,因为我真的不知道自己在做什么,但我认为我自己也遇到了这个问题,这是我能找到的最佳解决方案。我确定它有一个内置函数,但是...
#assumes index contains date info (mine had Hour, dow, and a date field)
# and assumes ordered by datetime timestampe
df_subset_for_rolling = df['Values to Avg'].groupby(level=['Hour','day_of_week','Timestamp Date']).mean().fillna(0)
list_of_unique_dow = df_subset_for_rolling.index.get_level_values('Hour').unique().tolist()
list_of_unique_hour = df_subset_for_rolling.index.get_level_values('day_of_week').unique().tolist()
comb_dow_hour = [(d,h) for d in list_of_unique_dow for h in list_of_unique_hour]
rolling_avg_df = pd.DataFrame()
for h_d_tuple in comb_dow_hour:
df_append = df_subset_for_rolling.loc[h_d_tuple,:].rolling(4,min_periods=3).mean()
df_append = pd.concat([df_append],keys=[h_d_tuple[0]],names=['Hour'])
df_append = pd.concat([df_append],keys=[h_d_tuple[1]],names=['day_of_week'])
rolling_avg_df = rolling_avg_df.append(df_append,ignore_index=False)
df = df.join(rolling_avg_df)
基本上,我只是提取每个子集,然后进行滚动累积平均,将其添加到单独的 df,然后将其组合回原始数据。