熊猫:将日期时间线从一个数据帧插入另一个数据帧

时间:2019-04-06 05:54:53

标签: python pandas dataframe merge

我有以下两个数据框:

main_df:

    value    feed_id                created_at  
0     0.0  1010077.0 2019-03-06 07:38:18-05:00   
1     1.0  1010077.0 2019-03-06 07:39:26-05:00   
2     1.0  1010077.0 2019-03-06 07:40:33-05:00   
3     1.0  1010077.0 2019-03-06 07:41:41-05:00   
4     1.0  1010077.0 2019-03-06 07:42:49-05:00   
5     1.0  1010077.0 2019-03-06 07:43:56-05:00   

aux_df:

       value    feed_id                created_at
0  20.298492  1009408.0 2019-03-06 07:35:33-05:00
1  20.315002  1009408.0 2019-03-06 07:36:34-05:00
2  20.315002  1009408.0 2019-03-06 07:37:36-05:00
3  20.359650  1009408.0 2019-03-06 07:38:36-05:00
4  20.359650  1009408.0 2019-03-06 07:39:37-05:00
5  20.383179  1009408.0 2019-03-06 07:40:38-05:00
6  20.383179  1009408.0 2019-03-06 07:41:38-05:00
7  20.449524  1009408.0 2019-03-06 07:42:39-05:00
8  20.449524  1009408.0 2019-03-06 07:43:40-05:00
9  20.521912  1009408.0 2019-03-06 07:44:41-05:00

在这种情况下,我想要以下内容(final_df):我希望aux_df的'created_at'列中描述的“时间轴”完全合并到main_df中,无论这两个列中是否都具有公共值。对于普通的时间戳,我将使用整个时间戳,而忽略以秒为单位的部分(请注意,所有值如何以相同的日期,小时和分钟而不是秒对齐)。

       value    feed_id                created_at
0        nan        nan 2019-03-06 07:35:33-05:00
1        nan        nan 2019-03-06 07:36:34-05:00
2        nan        nan 2019-03-06 07:37:36-05:00
3        0.0  1010077.0 2019-03-06 07:38:36-05:00
4        1.0  1010077.0 2019-03-06 07:39:37-05:00
5        1.0  1010077.0 2019-03-06 07:40:38-05:00
6        1.0  1010077.0 2019-03-06 07:41:38-05:00
7        1.0  1010077.0 2019-03-06 07:42:39-05:00
8        1.0  1010077.0 2019-03-06 07:43:40-05:00
9        nan        nan 2019-03-06 07:44:41-05:00

我尝试但未成功的策略:

  1. 使用以下方法在两个名为“ created_at_2”的数据框上创建新列 在每个时间戳上按分钟进行“四舍五入”,因此我可以丢弃 执行合并之前,时间戳记的秒数部分。
  2. 使用合并。

    main_df ['created_at_2'] = main_df.created_at.dt.round('min') aux_df ['created_at_2'] = aux_df.created_at.dt.round('min') final_df = pd.merge(main_df,aux_df,on = ['created_at_2'],how ='inner')

但是,如本示例所示,此方法并不可靠。当您四舍五入像2019-03-06 07:40:33-05:00这样的时间戳时,您将得到41分钟而不是40分钟。而且我需要一个连续的按分钟数列。

我可以使用以下命令重新格式化时间戳记时间轴:

main_df.created_at.map(lambda t: t.strftime('%Y-%m-%d %H:%M'))
aux_df.created_at.map(lambda t: t.strftime('%Y-%m-%d %H:%M'))
final_df = pd.merge(main_df, aux_df, on=['created_at_2'], how='inner')

但是不确定该方法是否健壮,我仍然需要索引“ created_at”列中不常见的值。那么,有没有更合适的方法来实现这一目标?

谢谢!

1 个答案:

答案 0 :(得分:1)

一个想法是使用merge_asof,但最后一行是不同的:

main_df['created_at'] = pd.to_datetime(main_df['created_at'])
aux_df['created_at'] = pd.to_datetime(aux_df['created_at'])

df = pd.merge_asof(aux_df[['created_at']], main_df, on=['created_at'])
print (df)
                 created_at  value    feed_id
0 2019-03-06 07:35:33-05:00    NaN        NaN
1 2019-03-06 07:36:34-05:00    NaN        NaN
2 2019-03-06 07:37:36-05:00    NaN        NaN
3 2019-03-06 07:38:36-05:00    0.0  1010077.0
4 2019-03-06 07:39:37-05:00    1.0  1010077.0
5 2019-03-06 07:40:38-05:00    1.0  1010077.0
6 2019-03-06 07:41:38-05:00    1.0  1010077.0
7 2019-03-06 07:42:39-05:00    1.0  1010077.0
8 2019-03-06 07:43:40-05:00    1.0  1010077.0
9 2019-03-06 07:44:41-05:00    1.0  1010077.0

另一种方法是使用Series.dt.floor代替round

main_df['created_at'] = pd.to_datetime(main_df['created_at'])
aux_df['created_at'] = pd.to_datetime(aux_df['created_at'])
main_df['created_at_2'] = main_df.created_at.dt.floor('min') 
aux_df['created_at_2'] = aux_df.created_at.dt.floor('min') 

df = pd.merge(aux_df[['created_at_2']], main_df, on=['created_at_2'], how='left')
print (df)
               created_at_2  value    feed_id                created_at
0 2019-03-06 07:35:00-05:00    NaN        NaN                       NaT
1 2019-03-06 07:36:00-05:00    NaN        NaN                       NaT
2 2019-03-06 07:37:00-05:00    NaN        NaN                       NaT
3 2019-03-06 07:38:00-05:00    0.0  1010077.0 2019-03-06 07:38:18-05:00
4 2019-03-06 07:39:00-05:00    1.0  1010077.0 2019-03-06 07:39:26-05:00
5 2019-03-06 07:40:00-05:00    1.0  1010077.0 2019-03-06 07:40:33-05:00
6 2019-03-06 07:41:00-05:00    1.0  1010077.0 2019-03-06 07:41:41-05:00
7 2019-03-06 07:42:00-05:00    1.0  1010077.0 2019-03-06 07:42:49-05:00
8 2019-03-06 07:43:00-05:00    1.0  1010077.0 2019-03-06 07:43:56-05:00
9 2019-03-06 07:44:00-05:00    NaN        NaN                       NaT