我有以下两个数据框:
main_df:
value feed_id created_at
0 0.0 1010077.0 2019-03-06 07:38:18-05:00
1 1.0 1010077.0 2019-03-06 07:39:26-05:00
2 1.0 1010077.0 2019-03-06 07:40:33-05:00
3 1.0 1010077.0 2019-03-06 07:41:41-05:00
4 1.0 1010077.0 2019-03-06 07:42:49-05:00
5 1.0 1010077.0 2019-03-06 07:43:56-05:00
aux_df:
value feed_id created_at
0 20.298492 1009408.0 2019-03-06 07:35:33-05:00
1 20.315002 1009408.0 2019-03-06 07:36:34-05:00
2 20.315002 1009408.0 2019-03-06 07:37:36-05:00
3 20.359650 1009408.0 2019-03-06 07:38:36-05:00
4 20.359650 1009408.0 2019-03-06 07:39:37-05:00
5 20.383179 1009408.0 2019-03-06 07:40:38-05:00
6 20.383179 1009408.0 2019-03-06 07:41:38-05:00
7 20.449524 1009408.0 2019-03-06 07:42:39-05:00
8 20.449524 1009408.0 2019-03-06 07:43:40-05:00
9 20.521912 1009408.0 2019-03-06 07:44:41-05:00
在这种情况下,我想要以下内容(final_df):我希望aux_df的'created_at'列中描述的“时间轴”完全合并到main_df中,无论这两个列中是否都具有公共值。对于普通的时间戳,我将使用整个时间戳,而忽略以秒为单位的部分(请注意,所有值如何以相同的日期,小时和分钟而不是秒对齐)。
value feed_id created_at
0 nan nan 2019-03-06 07:35:33-05:00
1 nan nan 2019-03-06 07:36:34-05:00
2 nan nan 2019-03-06 07:37:36-05:00
3 0.0 1010077.0 2019-03-06 07:38:36-05:00
4 1.0 1010077.0 2019-03-06 07:39:37-05:00
5 1.0 1010077.0 2019-03-06 07:40:38-05:00
6 1.0 1010077.0 2019-03-06 07:41:38-05:00
7 1.0 1010077.0 2019-03-06 07:42:39-05:00
8 1.0 1010077.0 2019-03-06 07:43:40-05:00
9 nan nan 2019-03-06 07:44:41-05:00
我尝试但未成功的策略:
使用合并。
main_df ['created_at_2'] = main_df.created_at.dt.round('min') aux_df ['created_at_2'] = aux_df.created_at.dt.round('min') final_df = pd.merge(main_df,aux_df,on = ['created_at_2'],how ='inner')
但是,如本示例所示,此方法并不可靠。当您四舍五入像2019-03-06 07:40:33-05:00这样的时间戳时,您将得到41分钟而不是40分钟。而且我需要一个连续的按分钟数列。
我可以使用以下命令重新格式化时间戳记时间轴:
main_df.created_at.map(lambda t: t.strftime('%Y-%m-%d %H:%M'))
aux_df.created_at.map(lambda t: t.strftime('%Y-%m-%d %H:%M'))
final_df = pd.merge(main_df, aux_df, on=['created_at_2'], how='inner')
但是不确定该方法是否健壮,我仍然需要索引“ created_at”列中不常见的值。那么,有没有更合适的方法来实现这一目标?
谢谢!
答案 0 :(得分:1)
一个想法是使用merge_asof
,但最后一行是不同的:
main_df['created_at'] = pd.to_datetime(main_df['created_at'])
aux_df['created_at'] = pd.to_datetime(aux_df['created_at'])
df = pd.merge_asof(aux_df[['created_at']], main_df, on=['created_at'])
print (df)
created_at value feed_id
0 2019-03-06 07:35:33-05:00 NaN NaN
1 2019-03-06 07:36:34-05:00 NaN NaN
2 2019-03-06 07:37:36-05:00 NaN NaN
3 2019-03-06 07:38:36-05:00 0.0 1010077.0
4 2019-03-06 07:39:37-05:00 1.0 1010077.0
5 2019-03-06 07:40:38-05:00 1.0 1010077.0
6 2019-03-06 07:41:38-05:00 1.0 1010077.0
7 2019-03-06 07:42:39-05:00 1.0 1010077.0
8 2019-03-06 07:43:40-05:00 1.0 1010077.0
9 2019-03-06 07:44:41-05:00 1.0 1010077.0
另一种方法是使用Series.dt.floor
代替round
:
main_df['created_at'] = pd.to_datetime(main_df['created_at'])
aux_df['created_at'] = pd.to_datetime(aux_df['created_at'])
main_df['created_at_2'] = main_df.created_at.dt.floor('min')
aux_df['created_at_2'] = aux_df.created_at.dt.floor('min')
df = pd.merge(aux_df[['created_at_2']], main_df, on=['created_at_2'], how='left')
print (df)
created_at_2 value feed_id created_at
0 2019-03-06 07:35:00-05:00 NaN NaN NaT
1 2019-03-06 07:36:00-05:00 NaN NaN NaT
2 2019-03-06 07:37:00-05:00 NaN NaN NaT
3 2019-03-06 07:38:00-05:00 0.0 1010077.0 2019-03-06 07:38:18-05:00
4 2019-03-06 07:39:00-05:00 1.0 1010077.0 2019-03-06 07:39:26-05:00
5 2019-03-06 07:40:00-05:00 1.0 1010077.0 2019-03-06 07:40:33-05:00
6 2019-03-06 07:41:00-05:00 1.0 1010077.0 2019-03-06 07:41:41-05:00
7 2019-03-06 07:42:00-05:00 1.0 1010077.0 2019-03-06 07:42:49-05:00
8 2019-03-06 07:43:00-05:00 1.0 1010077.0 2019-03-06 07:43:56-05:00
9 2019-03-06 07:44:00-05:00 NaN NaN NaT