样本数据:
import random
import string
import pandas as pd
test1 = pd.DataFrame({
'subID':[''.join(random.choice(string.ascii_letters[0:4]) for _ in range(3)) for n in range(100)],
'ID':[''.join(random.choice(string.ascii_letters[5:9]) for _ in range(3)) for n in range(100)],
'date':[pd.to_datetime(random.choice(['01-01-2018','02-01-2018','03-01-2018',
'04-01-2018','05-01-2018','06-01-2018',
'07-01-2018','08-01-2018','09-01-2018'])) for n in range(100)],
'val':[random.choice([1,2,3,4]) for n in range(100)]
}).sort_values('date').drop_duplicates(subset=['subID','date'])
idxs = pd.period_range(min(test1.date), max(test1.date), freq='M')
test1['date'] = pd.to_datetime(test1.date, format='%m-%d-%Y').dt.to_period("M")
df = pd.DataFrame()
for name, group in test1.groupby('subID'):
g_ = group.set_index('date').reindex(idxs).reset_index().rename(columns={'index': 'date'})
g_['subID'] = g_.subID.bfill().ffill()
g_['ID'] = g_.ID.bfill().ffill()
g_['val'] = g_.val.fillna(0)
df = df.append(g_).reset_index(drop=True)
现在df
上,我想在每个ID
的每个 rolling 3个月窗口中运行一个计算(如np.std)。所以:
for name, group in df.groupby('ID'):
...
然后在每个组中,我希望在3个月的滚动窗口中所有值的标准偏差。因此,如果一个ID
组中有3个subID
组,那么每个subID
组中的每个都有自己的日期和值集-如何获得所有滚动值的标准偏差该3个月窗口中每个subID
的值,然后保存并继续每3个月窗口计算一次?
如果数据如下:
date subID ID val
389 2018-03 dca fff 0.0
407 2018-03 dcc fff 0.0
390 2018-04 dca fff 1.0
408 2018-04 dcc fff 0.0
391 2018-05 dca fff 3.0
409 2018-05 dcc fff 0.0
392 2018-06 dca fff 0.0
410 2018-06 dcc fff 2.0
393 2018-07 dca fff 0.0
411 2018-07 dcc fff 0.0
394 2018-08 dca fff 3.0
412 2018-08 dcc fff 0.0
413 2018-09 dcc fff 4.0
然后窗口将是:
[2018-03, 2018-04, 2018-05]
,计算公式为:np.std(0,0,1,0,3,0)
[2018-04, 2018-05, 2018-06]
,计算公式为:np.std(1,0,3,0,0,2)
[2018-05, 2018-06, 2018-07]
,计算公式为:np.std(3,0,0,2,0,0)
以此类推...
因此最终,最终的数据集将是每个ID
每个月的标准差计算(前两个月除外-由于窗口大小)
答案 0 :(得分:1)
尝试以下摘要:
import numpy as np
df['month']=df.date.dt.month # adding month column for simplicity
mdf=pd.DataFrame({'month':[1,2,3,4,5,6,7,8,9]}) # for zero filling
df=df.groupby('ID').apply(lambda x: x[['ID','month','val']].merge(mdf, on='month', how='right').fillna(
{'ID':x.ID.dropna().unique()[0], 'val':0})).reset_index(drop=True) # zero filling for each ID
df1=df.groupby(['ID', 'month']).apply(lambda x: x.val.values).reset_index().rename({0:'val'}, axis=1) # Aggregating values for each ID and Month combination for further computation
def customrolling(x):
'''Function for iterating over each group (i.e. ID) and returning dataframe containing column 'stdval' which is rolling std of last 3 months for given ID.'''
stdval=[]
temp=pd.DataFrame(columns=['ID', 'month','stdval'])
for i,m in enumerate(x.iterrows()):
if i>=2:
stdval.append(np.std(np.concatenate(x.iloc[i-2:i+1,:]['val'].values, axis=0))) # calculating std for last 3 months for given ID and month and storing it in list
else:
stdval.append(0)
temp.ID=x.ID
temp.month=x.month
temp.stdval=stdval
return temp
target_df=df1.groupby('ID').apply(lambda x: customrolling(x)).reset_index(drop=True)
这将给出所需的target_df
:
ID month stdval
0 fff 1 0.0
1 fff 2 0.0
2 fff 3 0.0
3 fff 4 0.0
4 fff 5 0.0