我正在尝试找到解决以下问题的有效方法:
两个数据框,每个数据框都包含以下数据:
第一个:id, date, value
示例数据:
id, date, value
f130,200701,0.016196
f130,200702,-0.027798
f130,200703,-0.014868
f130,200704,0.017801
f130,200705,-0.032700
f130,200706,0.049529
f130,200707,0.011610
f130,200708,-0.008145
f130,200709,-0.001493
f130,200710,0.009719
f130,200711,-0.007775
f130,200712,-0.007835
f131,200701,0.044754
f131,200702,0.004679
f131,200703,-0.011824
f131,200704,0.007252
f131,200705,0.029877
f131,200706,0.001748
f131,200707,0.001047
f131,200708,-0.003137
f131,200709,0.001748
f131,200710,0.006632
f131,200711,-0.012136
f131,200712,0.004914
第二个:id_2, date, value
示例数据:
id_2, date, value
d_1,200701,0.026316
d_1,200702,-0.004487
d_1,200703,-0.027044
d_1,200704,-0.035076
d_1,200705,0.010288
d_1,200706,-0.031908
d_1,200707,-0.001403
d_1,200708,0.009831
d_1,200709,0.040334
d_1,200710,0.018048
d_1,200711,0.011819
d_1,200712,0.000000
d_2,200701,0.000000
d_2,200702,0.028553
d_2,200703,-0.037224
d_2,200704,0.041284
d_2,200705,0.151038
d_2,200706,0.061236
d_2,200707,0.001030
d_2,200708,-0.042203
d_2,200709,0.000000
d_2,200710,0.006986
d_2,200711,-0.018676
d_2,200712,0.001087
我需要的是所有date
对的两个value
列之间的滚动窗口关联(翻转id & id_2
列)
基本上,我的输出应该是:"id vs id_2", date, corr
因此,对于d1 vs f130
,对于200706
,我计算d1
和f130
的值之间的相关性,从200706
开始追溯6
个月。所有对都一样。
预期产出:
id_pair, date, value
d1_f130,200706,-0.375238392
d1_f130,200707,-0.667154011
d1_f130,200708,-0.636064899
d1_f130,200709,-0.672029012
d1_f130,200710,-0.653719992
d1_f130,200711,-0.802893705
d1_f130,200712,-0.03120143
d1_f131,200706,0.870717009
d1_f131,200707,0.61076152
d1_f131,200708,0.400632396
d1_f131,200709,0.05064842
d1_f131,200710,0.087102168
d1_f131,200711,-0.012306865
d1_f131,200712,0.05170204
d2_f130,200706,-0.170979922
d2_f130,200707,-0.15363222
d2_f130,200708,-0.089709021
d2_f130,200709,-0.227564277
d2_f130,200710,-0.252391258
d2_f130,200711,0.94878745
d2_f130,200712,0.619029635
d2_f131,200706,0.358385975
d2_f131,200707,0.952074283
d2_f131,200708,0.930805345
d2_f131,200709,0.919101445
d2_f131,200710,0.904473885
d2_f131,200711,0.47080201
d2_f131,200712,0.640334152
使用for循环迭代id和日期需要几天......(id' s~15000,id_2~300,date~300)
有什么想法吗?
答案 0 :(得分:2)
假设您有两个数据帧,如下所示:
# I change the columns name to simplify your pb
df1 = pd.DataFrame({'id1':id1, 'date':d1,'value1':v1})
df2 = pd.DataFrame({'id2':id2, 'date':d1,'value2':v2})
然后,您可以将两者合并为一个df
,如:
df = df1.merge(df2,how='outer',on='date')
print(df) #for ex:
date id1 value1 id2 value2
0 2008-01-01 13:30:00 0 59.727276 5 49.423527
1 2008-01-01 13:30:00 0 59.727276 4 49.659602
现在groupby id并应用你的滚动关联:
dfs = [] #create a collection to store each groupby result
for n, g in df.groupby(['id1','id2']):
_df = pd.DataFrame({'ids':[n]*len(g.date),'date':g.date})
#compute the correlations between the series of values
_df['corr'] = g.value1.rolling(10).corr(g.value2)
dfs.append(_df)
#concatenate your dataframes to have a single one
final_df = pd.concat(dfs, ignore_index=True)
print(final_df) #show result. for ex:
#Note that first 9 rows for each ids pair are NaN according to my rolling corr options.
date ids corr
0 2008-01-01 13:34:00 (0, 0) NaN
1 2008-01-01 13:34:00 (0, 0) NaN
2 2008-01-01 13:35:00 (0, 0) NaN
3 2008-01-01 13:37:00 (0, 0) NaN
4 2008-01-01 13:37:00 (0, 0) NaN
5 2008-01-01 13:37:00 (0, 0) NaN
6 2008-01-01 13:38:00 (0, 0) NaN
7 2008-01-01 13:38:00 (0, 0) NaN
8 2008-01-01 13:40:00 (0, 0) NaN
9 2008-01-01 13:41:00 (0, 0) 0.423877
10 2008-01-01 13:42:00 (0, 0) 0.555128
注意:
ids
中有(int,int)
<强>更新强>:
您可以重命名示例的标题以适合这样的答案:
df1.columns = ['id1','date','value1']
df2.columns = ['id2','date','value2']
您可以更改ids
以适合您的预期输出替换
'ids':[n]*len(g.date)
通过
'ids':['_'.join(n)]*len(g.date)
例如。