比较DataFrame中的范围与连接匹配之间的日期

时间:2017-12-15 22:30:55

标签: python pandas

我正在努力调整会在非常具体的时间段内改变客户发票的事件。我很想把这些数据加载到几个本地MySQL表中,但我很好奇如何在pandas中专门解决这个问题。

这里有一些假数据:

import numpy as np
import pandas as pd

data = {
    'user_id':['123','456','789','333'],
    'first_invoice_date':['2017-10-01','2017-03-01','2017-02-01','2017-08-01'],
    'third_invoice_date':['2017-12-31','2017-05-31','2017-04-30','2017-10-31']
}
df = pd.DataFrame(data,columns=['user_id','first_invoice_date','third_invoice_date'])

events = {
    'user_id':['123','123','456','789','789','101'],
    'event_type':['downgrade','cancel','refund','downgrade','cancel','discount],
    'event_date':['2017-11-15','2017-12-08','2017-01-23','2017-02-15','2017-02-28','2017-04-05']
}
event_df = pd.DataFrame(events,columns=['user_id','event_type','event_date'])

df['first_invoice_date'] = pd.to_datetime(df['first_invoice_date'])
df['third_invoice_date'] = pd.to_datetime(df['third_invoice_date'])
event_df['event_date'] = pd.to_datetime(event_df['event_date'])

示例数据生成以下数据帧:

In [2]: df
Out[2]:
  user_id first_invoice_date third_invoice_date
0     123         2017-10-01         2017-12-31
1     456         2017-03-01         2017-05-31
2     789         2017-02-01         2017-04-30
3     333         2017-08-01         2017-10-31

In [3]: event_df
Out[3]:
  user_id event_type  event_date
0     123  downgrade  2017-11-15
1     123     cancel  2017-12-08
2     456     refund  2017-01-23
3     789  downgrade  2017-02-15
4     789     cancel  2017-02-28
5     101   discount  2017-04-05

我想要的是如果事件1)与user_id匹配且2)在invoice的{​​{1}}日期之间,我想将第一个这样的事件连接到{{ 1}},这将导致:

df

请注意,事件可能发生在任何时间范围内,有些超出df范围(在之前或之后),或者In [4]: df Out[4]: user_id first_invoice_date third_invoice_date event_type event_date 0 123 2017-10-01 2017-12-31 downgrade 2017-11-15 1 456 2017-03-01 2017-05-31 np.nan np.nan 2 789 2017-02-01 2017-04-30 downgrade 2017-02-15 3 333 2017-08-01 2017-10-31 np.nan np.nan 范围内有多个事件。可能还有任何其他invoice个,或者没有任何事件,或者与主invoice无关。

1 个答案:

答案 0 :(得分:0)

我们使用merge +布尔索引+ drop_duplicates + combine_first(注意:新的是我在这里设置的参数,您可以使用drop('New',1)来删除它结束)

target=df.merge(event_df,on='user_id',how='left')
target['New']=(target.event_date>=target.first_invoice_date)&(target.event_date<=target.third_invoice_date)
df.set_index('user_id').\
   combine_first((target[target.New].drop_duplicates('user_id').\ 
      set_index('user_id')))
Out[531]: 
         New event_date event_type first_invoice_date third_invoice_date
user_id                                                                 
123      1.0 2017-11-15  downgrade         2017-10-01         2017-12-31
333      NaN        NaT        NaN         2017-08-01         2017-10-31
456      NaN        NaT        NaN         2017-03-01         2017-05-31
789      1.0 2017-02-15  downgrade         2017-02-01         2017-04-30