熊猫groupby在滚动窗口上

时间:2019-10-07 18:00:08

标签: python pandas

样本数据:

import random
import string
import pandas as pd

test1 = pd.DataFrame({
    'subID':[''.join(random.choice(string.ascii_letters[0:4]) for _ in range(3)) for n in range(100)],
    'ID':[''.join(random.choice(string.ascii_letters[5:9]) for _ in range(3)) for n in range(100)],
    'date':[pd.to_datetime(random.choice(['01-01-2018','02-01-2018','03-01-2018',
                                          '04-01-2018','05-01-2018','06-01-2018',
                                          '07-01-2018','08-01-2018','09-01-2018'])) for n in range(100)],
    'val':[random.choice([1,2,3,4]) for n in range(100)]
}).sort_values('date').drop_duplicates(subset=['subID','date'])

idxs = pd.period_range(min(test1.date), max(test1.date), freq='M')

test1['date'] = pd.to_datetime(test1.date, format='%m-%d-%Y').dt.to_period("M")

df = pd.DataFrame()
for name, group in test1.groupby('subID'):
    g_ = group.set_index('date').reindex(idxs).reset_index().rename(columns={'index': 'date'})
    g_['subID'] = g_.subID.bfill().ffill()
    g_['ID'] = g_.ID.bfill().ffill()
    g_['val'] = g_.val.fillna(0)
    df = df.append(g_).reset_index(drop=True)

现在df上,我想在每个ID的每个 rolling 3个月窗口中运行一个计算(如np.std)。所以:

for name, group in df.groupby('ID'):
    ...

然后在每个组中,我希望在3个月的滚动窗口中所有值的标准偏差。因此,如果一个ID组中有3个subID组,那么每个subID组中的每个都有自己的日期和值集-如何获得所有滚动值的标准偏差该3个月窗口中每个subID的值,然后保存并继续每3个月窗口计算一次?

如果数据如下:

        date subID   ID  val
389  2018-03   dca  fff  0.0
407  2018-03   dcc  fff  0.0
390  2018-04   dca  fff  1.0
408  2018-04   dcc  fff  0.0
391  2018-05   dca  fff  3.0
409  2018-05   dcc  fff  0.0
392  2018-06   dca  fff  0.0
410  2018-06   dcc  fff  2.0
393  2018-07   dca  fff  0.0
411  2018-07   dcc  fff  0.0
394  2018-08   dca  fff  3.0
412  2018-08   dcc  fff  0.0
413  2018-09   dcc  fff  4.0

然后窗口将是:

[2018-03, 2018-04, 2018-05],计算公式为:np.std(0,0,1,0,3,0)

[2018-04, 2018-05, 2018-06],计算公式为:np.std(1,0,3,0,0,2)

[2018-05, 2018-06, 2018-07],计算公式为:np.std(3,0,0,2,0,0)

以此类推...

因此最终,最终的数据集将是每个ID每个月的标准差计算(前两个月除外-由于窗口大小)

1 个答案:

答案 0 :(得分:1)

尝试以下摘要:

import numpy as np

df['month']=df.date.dt.month # adding month column for simplicity

mdf=pd.DataFrame({'month':[1,2,3,4,5,6,7,8,9]}) # for zero filling

df=df.groupby('ID').apply(lambda x: x[['ID','month','val']].merge(mdf, on='month', how='right').fillna(
{'ID':x.ID.dropna().unique()[0], 'val':0})).reset_index(drop=True) # zero filling for each ID

df1=df.groupby(['ID', 'month']).apply(lambda x: x.val.values).reset_index().rename({0:'val'}, axis=1) # Aggregating values for each ID and Month combination for further computation

def customrolling(x):
    '''Function for iterating over each group (i.e. ID) and returning dataframe containing column 'stdval' which is rolling std of last 3 months for given ID.'''
    stdval=[]
    temp=pd.DataFrame(columns=['ID', 'month','stdval'])
    for i,m in enumerate(x.iterrows()):
        if i>=2:
            stdval.append(np.std(np.concatenate(x.iloc[i-2:i+1,:]['val'].values, axis=0))) # calculating std for last 3 months for given ID and month and storing it in list
        else:
            stdval.append(0)
    temp.ID=x.ID
    temp.month=x.month
    temp.stdval=stdval
    return temp

target_df=df1.groupby('ID').apply(lambda x: customrolling(x)).reset_index(drop=True)

这将给出所需的target_df

 ID  month  stdval
0  fff      1     0.0
1  fff      2     0.0
2  fff      3     0.0
3  fff      4     0.0
4  fff      5     0.0