我有两个DataFrame,它们具有在不同频率下测量的不同数据,如那些csv示例中一样:
df1:
i,m1,m2,t
0,0.556529,6.863255,43564.844
1,0.5565576199999884,6.86327749999999,43564.863999999994
2,0.5565559400000003,6.8632764,43564.884
3,0.5565699799999941,6.863286799999996,43564.903999999995
4,0.5565570200000007,6.863277200000001,43564.924
5,0.5565316400000097,6.863257100000007,43564.944
...
df2:
i,m3,m4,t
0,306.81162500000596,-1.2126870045404683,43564.878125
1,306.86175000000725,-1.1705838272666433,43564.928250000004
2,306.77552454544787,-1.1240195386446195,43564.97837499999
3,306.85900545454086,-1.0210345363692084,43565.0285
4,306.8354250000052,-1.0052431772666657,43565.078625
5,306.88397499999286,-0.9468344809917896,43565.12875
...
我想获得一个df,该df在第一个df时具有两个df的所有测量值(它们获取数据的频率较低)。
我尝试使用for循环对df1的两个时间戳之间的df2量度进行平均,但它非常慢。
答案 0 :(得分:1)
IIUC,i
是索引列,您想将df2['t']
放在bin中并对其他列取平均值。因此,您可以使用pd.cut
:
groups =pd.cut(df2.t, bins= list(df1.t) + [np.inf],
right=False,
labels=df1['t'])
# cols to copy
cols = [col for col in df2.columns if col != 't']
# groupby and get the average
new_df = (df2[cols].groupby(groups)
.mean()
.reset_index()
)
然后new_df
是:
t m3 m4
0 43564.844 NaN NaN
1 43564.864 306.811625 -1.212687
2 43564.884 NaN NaN
3 43564.904 NaN NaN
4 43564.924 306.861750 -1.170584
5 43564.944 306.838482 -1.024283
您可以与df1
上的t
合并:
df1.merge(new_df, on='t', how='left')
并获得:
m1 m2 t m3 m4
0 0.556529 6.863255 43564.8 NaN NaN
1 0.556558 6.863277 43564.9 306.811625 -1.212687
2 0.556556 6.863276 43564.9 NaN NaN
3 0.556570 6.863287 43564.9 NaN NaN
4 0.556557 6.863277 43564.9 306.861750 -1.170584
5 0.556532 6.863257 43564.9 306.838482 -1.024283