用Pandas或NumPy进行滚动相关计算

时间:2018-01-09 09:39:35

标签: python pandas numpy

我正在尝试找到解决以下问题的有效方法:

两个数据框,每个数据框都包含以下数据:
第一个:id, date, value

示例数据:

id, date, value
f130,200701,0.016196
f130,200702,-0.027798
f130,200703,-0.014868
f130,200704,0.017801
f130,200705,-0.032700
f130,200706,0.049529
f130,200707,0.011610
f130,200708,-0.008145
f130,200709,-0.001493
f130,200710,0.009719
f130,200711,-0.007775
f130,200712,-0.007835
f131,200701,0.044754
f131,200702,0.004679
f131,200703,-0.011824
f131,200704,0.007252
f131,200705,0.029877
f131,200706,0.001748
f131,200707,0.001047
f131,200708,-0.003137
f131,200709,0.001748
f131,200710,0.006632
f131,200711,-0.012136
f131,200712,0.004914

第二个:id_2, date, value

示例数据:

id_2, date, value
d_1,200701,0.026316
d_1,200702,-0.004487
d_1,200703,-0.027044
d_1,200704,-0.035076
d_1,200705,0.010288
d_1,200706,-0.031908
d_1,200707,-0.001403
d_1,200708,0.009831
d_1,200709,0.040334
d_1,200710,0.018048
d_1,200711,0.011819
d_1,200712,0.000000
d_2,200701,0.000000
d_2,200702,0.028553
d_2,200703,-0.037224
d_2,200704,0.041284
d_2,200705,0.151038
d_2,200706,0.061236
d_2,200707,0.001030
d_2,200708,-0.042203
d_2,200709,0.000000
d_2,200710,0.006986
d_2,200711,-0.018676
d_2,200712,0.001087

我需要的是所有date对的两个value列之间的滚动窗口关联(翻转id & id_2列) 基本上,我的输出应该是:"id vs id_2", date, corr 因此,对于d1 vs f130,对于200706,我计算d1f130的值之间的相关性,从200706开始追溯6个月。所有对都一样。 预期产出:

id_pair, date, value
d1_f130,200706,-0.375238392
d1_f130,200707,-0.667154011
d1_f130,200708,-0.636064899
d1_f130,200709,-0.672029012
d1_f130,200710,-0.653719992
d1_f130,200711,-0.802893705
d1_f130,200712,-0.03120143
d1_f131,200706,0.870717009
d1_f131,200707,0.61076152
d1_f131,200708,0.400632396
d1_f131,200709,0.05064842
d1_f131,200710,0.087102168
d1_f131,200711,-0.012306865
d1_f131,200712,0.05170204
d2_f130,200706,-0.170979922
d2_f130,200707,-0.15363222
d2_f130,200708,-0.089709021
d2_f130,200709,-0.227564277
d2_f130,200710,-0.252391258
d2_f130,200711,0.94878745
d2_f130,200712,0.619029635
d2_f131,200706,0.358385975
d2_f131,200707,0.952074283
d2_f131,200708,0.930805345
d2_f131,200709,0.919101445
d2_f131,200710,0.904473885
d2_f131,200711,0.47080201
d2_f131,200712,0.640334152

使用for循环迭代id和日期需要几天......(id' s~15000,id_2~300,date~300)

有什么想法吗?

1 个答案:

答案 0 :(得分:2)

假设您有两个数据帧,如下所示:

# I change the columns name to simplify your pb
df1 = pd.DataFrame({'id1':id1, 'date':d1,'value1':v1})
df2 = pd.DataFrame({'id2':id2, 'date':d1,'value2':v2})

然后,您可以将两者合并为一个df,如:

df = df1.merge(df2,how='outer',on='date')

print(df) #for ex:
                   date    id1   value1    id2  value2
0    2008-01-01 13:30:00    0  59.727276    5  49.423527
1    2008-01-01 13:30:00    0  59.727276    4  49.659602

现在groupby id并应用你的滚动关联:

dfs = [] #create a collection to store each groupby result
for n, g in df.groupby(['id1','id2']):
    _df = pd.DataFrame({'ids':[n]*len(g.date),'date':g.date})
    #compute the correlations between the series of values
    _df['corr'] = g.value1.rolling(10).corr(g.value2)
    dfs.append(_df)

#concatenate your dataframes to have a single one 
final_df = pd.concat(dfs, ignore_index=True)


print(final_df) #show result. for ex:
#Note that first 9 rows for each ids pair are NaN according to my rolling corr options.
                   date     ids       corr
0    2008-01-01 13:34:00  (0, 0)       NaN
1    2008-01-01 13:34:00  (0, 0)       NaN
2    2008-01-01 13:35:00  (0, 0)       NaN
3    2008-01-01 13:37:00  (0, 0)       NaN
4    2008-01-01 13:37:00  (0, 0)       NaN
5    2008-01-01 13:37:00  (0, 0)       NaN
6    2008-01-01 13:38:00  (0, 0)       NaN
7    2008-01-01 13:38:00  (0, 0)       NaN
8    2008-01-01 13:40:00  (0, 0)       NaN
9    2008-01-01 13:41:00  (0, 0)  0.423877
10   2008-01-01 13:42:00  (0, 0)  0.555128

注意:

<强>更新
您可以重命名示例的标题以适合这样的答案:

df1.columns  = ['id1','date','value1']
df2.columns  = ['id2','date','value2']

您可以更改ids以适合您的预期输出替换
'ids':[n]*len(g.date)
通过
'ids':['_'.join(n)]*len(g.date)
例如。