python3.5 / pandas - 按周和小时滚动平均值

时间:2016-06-16 06:50:50

标签: python pandas

尝试弄清楚如何使用滚动平均值,在计算统计数据之前考虑日期和小时。

文件看起来像这样:

 date       hour    price
 1/1/2016    1        a
 1/1/2016    2        b
    .        .        .
    .        .        .
 1/8/2016    1        c
 1/8/2016    2        d
    .        .        .
    .        .        .
 1/15/2016   1        e
 1/15/2016   2        f    

虽然输出列应该是这样的。

 date       hour    price    ma
 1/1/2016    1        a
 1/1/2016    2        b
    .        .        .
    .        .        .
 1/8/2016    1        c
 1/8/2016    2        d
    .        .        .
    .        .        .
 1/15/2016   1        e    mean(a,c)
 1/15/2016   2        f    mean(b,d) 

2 个答案:

答案 0 :(得分:1)

这不是100%清楚你想要什么,但这是我做出的假设......

您想要在特定日期之前的所有日期的小时平均值。这段代码就是这样......

import pandas as pd
import numpy as np
import datetime

# build a sample table
np.random.seed(1)
values = np.random.choice(range(1, 11), 25)
dates = np.random.choice(pd.date_range(datetime.date(2016, 1, 1), datetime.date(2016, 1, 4)), 25)
hours = np.random.choice(range(4), 25)
df = pd.DataFrame({'date': dates, 'hour': hours, 'value': values})

df看起来像这样......

        date  hour  value
0 2016-01-03     1      6
1 2016-01-01     2      9
2 2016-01-03     2     10
3 2016-01-02     0      6
4 2016-01-03     3      1
5 2016-01-01     3      1
6 2016-01-04     1      2
7 2016-01-01     1      8
8 2016-01-03     3      7
9 2016-01-01     2     10

现在进行转型......

df.sort_values(['date', 'hour'], inplace=True)
groups = df.groupby(['hour'])

# calculate the rolling mean and sub out the current day's value then...
#     divide by count of previous observations (works because cumcount is base 0)
df['rolling_mean'] = (groups.transform(np.cumsum)['value'] - df.value) / (groups.cumcount())

# just to show result
df.sort_values(['hour', 'date'])

结果是......

        date  hour  value  rolling_mean
3 2016-01-02     0      6           NaN
7 2016-01-01     1      8           NaN
0 2016-01-03     1      6           8.0
6 2016-01-04     1      2           7.0
1 2016-01-01     2      9           NaN
9 2016-01-01     2     10           9.0
2 2016-01-03     2     10           9.5
5 2016-01-01     3      1           NaN
4 2016-01-03     3      1           1.0
8 2016-01-03     3      7           1.0

由你来做你想要的NaNs ......

答案 1 :(得分:0)

请稍加保留,因为我真的不知道自己在做什么,但我认为我自己也遇到了这个问题,这是我能找到的最佳解决方案。我确定它有一个内置函数,但是...

#assumes index contains date info (mine had Hour, dow, and a date field)
# and assumes ordered by datetime timestampe

df_subset_for_rolling = df['Values to Avg'].groupby(level=['Hour','day_of_week','Timestamp Date']).mean().fillna(0)

list_of_unique_dow = df_subset_for_rolling.index.get_level_values('Hour').unique().tolist()
list_of_unique_hour = df_subset_for_rolling.index.get_level_values('day_of_week').unique().tolist()

comb_dow_hour = [(d,h) for d in list_of_unique_dow for h in list_of_unique_hour]

rolling_avg_df = pd.DataFrame()
for h_d_tuple in comb_dow_hour:
    df_append = df_subset_for_rolling.loc[h_d_tuple,:].rolling(4,min_periods=3).mean()
    df_append = pd.concat([df_append],keys=[h_d_tuple[0]],names=['Hour'])
    df_append = pd.concat([df_append],keys=[h_d_tuple[1]],names=['day_of_week'])
    rolling_avg_df = rolling_avg_df.append(df_append,ignore_index=False)

df = df.join(rolling_avg_df)

基本上,我只是提取每个子集,然后进行滚动累积平均,将其添加到单独的 df,然后将其组合回原始数据。

参考:enter image description here 用于添加回索引值