我有两个数据帧,可以用以下MWE表示:
import pandas as pd
from datetime import datetime
import numpy as np
df_1 = pd.DataFrame(np.random.randn(9), columns = ['A'], index= [
datetime(2015,1,1,19,30,1,20),
datetime(2015,1,1,20,30,2,12),
datetime(2015,1,1,21,30,3,50),
datetime(2015,1,1,22,30,5,43),
datetime(2015,1,1,22,30,52,11),
datetime(2015,1,1,23,30,54,8),
datetime(2015,1,1,23,40,14,2),
datetime(2015,1,1,23,41,13,33),
datetime(2015,1,1,23,50,21,32),
])
df_2 = pd.DataFrame(np.random.randn(9), columns = ['B'], index= [
datetime(2015,1,1,18,30,1,20),
datetime(2015,1,1,21,0,2,12),
datetime(2015,1,1,21,31,3,50),
datetime(2015,1,1,22,34,5,43),
datetime(2015,1,1,22,35,52,11),
datetime(2015,1,1,23,0,54,8),
datetime(2015,1,1,23,41,14,2),
datetime(2015,1,1,23,42,13,33),
datetime(2015,1,1,23,56,21,32),
])
我想将两个数据帧合并为一个,我知道我可以使用以下代码执行此操作:
In [21]: df_1.join(df_2, how='outer')
Out[21]:
A B
2015-01-01 18:30:01.000020 NaN -1.411907
2015-01-01 19:30:01.000020 0.109913 NaN
2015-01-01 20:30:02.000012 -0.440529 NaN
2015-01-01 21:00:02.000012 NaN -1.277403
2015-01-01 21:30:03.000050 -0.194020 NaN
2015-01-01 21:31:03.000050 NaN -0.042259
2015-01-01 22:30:05.000043 1.445220 NaN
2015-01-01 22:30:52.000011 -0.341176 NaN
2015-01-01 22:34:05.000043 NaN 0.905912
2015-01-01 22:35:52.000011 NaN -0.167559
2015-01-01 23:00:54.000008 NaN 1.289961
2015-01-01 23:30:54.000008 -0.929973 NaN
2015-01-01 23:40:14.000002 0.077622 NaN
2015-01-01 23:41:13.000033 -1.688719 NaN
2015-01-01 23:41:14.000002 NaN 0.178439
2015-01-01 23:42:13.000033 NaN -0.911314
2015-01-01 23:50:21.000032 -0.750953 NaN
2015-01-01 23:56:21.000032 NaN 0.092930
这不是我想要实现的目标。
我想将df_2与df_1合并为df_1的时间序列索引 - 其中“B”列中的值将是最接近df_1中索引的值的时间。
我之前使用iterrows
和relativedelta
之前已达到此目的,如下所示:
for i, row in df_1.iterrows():
df_2_temp = df_2.copy()
df_2_temp['Timestamp'] = df_2_temp.index
df_2_temp['Time Delta'] = abs(df_2_temp['Timestamp'] - row.name).apply(lambda x: x.seconds)
closest_value = df_2_temp.sort_values('Time Delta').iloc[0]['B']
df_1.loc[row.name, 'B'] = closest_value
这样可行,但这很慢,我想要执行此操作的数据帧非常大。
有更快的解决方案吗?也许是熊猫内置的?
答案 0 :(得分:0)
这可能会更快,即使{{1}}仍然是幕后的循环。
apply
我将DatetimeIndex转换为系列,以便应用def find_idxmin(dt):
return (df_2.index - dt).to_series().reset_index(drop=True).abs().idxmin()
df_1.apply(lambda row: df_2.iloc[find_idxmin(row.name)], axis=1)
和abs
。我重置了索引,以便idxmin
返回一个我可以输入idxmin
的行号。
编辑:这似乎与评论中链接的基于numpy的答案一样快(5毫秒):
iloc
相比之下,您的解决方案在30毫秒(而不是5英寸)运行。
答案 1 :(得分:0)
熊猫现在提供了我认为您正在寻找的功能:
pd.merge_asof(df1, df2, direction='nearest')
示例: 我有两个设备。 我每个设备都有一个DataFrame,每个设备都有一个Date列,其类型为“ datetime64 [ns,UTC]”
t_df[['dt', 'mode', 'state']]:
dt mode state
0 2020-09-23 22:10:36.508000+00:00 1 0
1 2020-09-23 22:10:57.463000+00:00 1 0
2 2020-09-23 22:11:18.815000+00:00 1 0
3 2020-09-23 22:12:16.806000+00:00 1 0
4 2020-09-23 22:12:22.512000+00:00 1 0
5 2020-09-23 22:12:43.469000+00:00 1 0
6 2020-09-23 22:13:04.776000+00:00 1 0
7 2020-09-23 22:13:25.948000+00:00 1 0
8 2020-09-23 22:13:47.223000+00:00 1 0
v_df[['dt', 'temperature', 'pressure']]:
dt temperature pressure
0 2020-09-23 22:12:04.204000+00:00 74.85 1004.50
1 2020-09-23 22:12:18.203000+00:00 74.82 1004.67
2 2020-09-23 22:12:30.358000+00:00 74.85 1004.71
3 2020-09-23 22:12:44.601000+00:00 74.82 1004.46
4 2020-09-23 22:12:59.158000+00:00 74.82 1004.67
5 2020-09-23 22:13:10.443000+00:00 74.82 1004.67
6 2020-09-23 22:13:24.577000+00:00 74.82 1004.67
7 2020-09-23 22:13:37.544000+00:00 74.82 1004.67
8 2020-09-23 22:13:50.106000+00:00 74.78 1004.63
9 2020-09-23 22:14:03.377000+00:00 74.78 1004.42
我用过:
new_df = pd.merge_asof(v_df[['dt', 'temperature', 'pressure']], t_df[['dt', 'mode', 'state']], direction='nearest')
和我的结果:
dt temperature pressure mode state
0 2020-09-23 22:12:04.204000+00:00 74.85 1004.50 1 0
1 2020-09-23 22:12:18.203000+00:00 74.82 1004.67 1 0
2 2020-09-23 22:12:30.358000+00:00 74.85 1004.71 1 0
3 2020-09-23 22:12:44.601000+00:00 74.82 1004.46 1 0
4 2020-09-23 22:12:59.158000+00:00 74.82 1004.67 1 0
5 2020-09-23 22:13:10.443000+00:00 74.82 1004.67 1 0
6 2020-09-23 22:13:24.577000+00:00 74.82 1004.67 1 0
7 2020-09-23 22:13:37.544000+00:00 74.82 1004.67 1 0
8 2020-09-23 22:13:50.106000+00:00 74.78 1004.63 1 0
9 2020-09-23 22:14:03.377000+00:00 74.78 1004.42 1 0
**此示例只是每个DataFrame的最后10种情况,其顶部相距仅数分钟。下面是在完整的DataFrame上运行后的最后10种情况(注意:在df1和df2的合并操作中分别添加了“日期”和“时间”以供参考):
combo_df.iloc[-10:][['dt', 'date', 'time', 'pressure', 'temperature', 'mode', 'state']]
dt date time pressure temperature mode state
4440 2020-09-23 22:12:04.204000+00:00 2020-09-23T22:12:04.204Z 2020-09-23T22:12:16.806Z 1004.50 74.85 1 0
4441 2020-09-23 22:12:18.203000+00:00 2020-09-23T22:12:18.203Z 2020-09-23T22:12:16.806Z 1004.67 74.82 1 0
4442 2020-09-23 22:12:30.358000+00:00 2020-09-23T22:12:30.358Z 2020-09-23T22:12:22.512Z 1004.71 74.85 1 0
4443 2020-09-23 22:12:44.601000+00:00 2020-09-23T22:12:44.601Z 2020-09-23T22:12:43.469Z 1004.46 74.82 1 0
4444 2020-09-23 22:12:59.158000+00:00 2020-09-23T22:12:59.158Z 2020-09-23T22:13:04.776Z 1004.67 74.82 1 0
4445 2020-09-23 22:13:10.443000+00:00 2020-09-23T22:13:10.443Z 2020-09-23T22:13:04.776Z 1004.67 74.82 1 0
4446 2020-09-23 22:13:24.577000+00:00 2020-09-23T22:13:24.577Z 2020-09-23T22:13:25.948Z 1004.67 74.82 1 0
4447 2020-09-23 22:13:37.544000+00:00 2020-09-23T22:13:37.544Z 2020-09-23T22:13:47.223Z 1004.67 74.82 1 0
4448 2020-09-23 22:13:50.106000+00:00 2020-09-23T22:13:50.106Z 2020-09-23T22:13:47.223Z 1004.63 74.78 1 0
4449 2020-09-23 22:14:03.377000+00:00 2020-09-23T22:14:03.377Z 2020-09-23T22:14:08.981Z 1004.42 74.78 1 0