我正在努力调整会在非常具体的时间段内改变客户发票的事件。我很想把这些数据加载到几个本地MySQL表中,但我很好奇如何在pandas
中专门解决这个问题。
这里有一些假数据:
import numpy as np
import pandas as pd
data = {
'user_id':['123','456','789','333'],
'first_invoice_date':['2017-10-01','2017-03-01','2017-02-01','2017-08-01'],
'third_invoice_date':['2017-12-31','2017-05-31','2017-04-30','2017-10-31']
}
df = pd.DataFrame(data,columns=['user_id','first_invoice_date','third_invoice_date'])
events = {
'user_id':['123','123','456','789','789','101'],
'event_type':['downgrade','cancel','refund','downgrade','cancel','discount],
'event_date':['2017-11-15','2017-12-08','2017-01-23','2017-02-15','2017-02-28','2017-04-05']
}
event_df = pd.DataFrame(events,columns=['user_id','event_type','event_date'])
df['first_invoice_date'] = pd.to_datetime(df['first_invoice_date'])
df['third_invoice_date'] = pd.to_datetime(df['third_invoice_date'])
event_df['event_date'] = pd.to_datetime(event_df['event_date'])
示例数据生成以下数据帧:
In [2]: df
Out[2]:
user_id first_invoice_date third_invoice_date
0 123 2017-10-01 2017-12-31
1 456 2017-03-01 2017-05-31
2 789 2017-02-01 2017-04-30
3 333 2017-08-01 2017-10-31
In [3]: event_df
Out[3]:
user_id event_type event_date
0 123 downgrade 2017-11-15
1 123 cancel 2017-12-08
2 456 refund 2017-01-23
3 789 downgrade 2017-02-15
4 789 cancel 2017-02-28
5 101 discount 2017-04-05
我想要的是如果事件1)与user_id
匹配且2)在invoice
的{{1}}日期之间,我想将第一个这样的事件连接到{{ 1}},这将导致:
df
请注意,事件可能发生在任何时间范围内,有些超出df
范围(在之前或之后),或者In [4]: df
Out[4]:
user_id first_invoice_date third_invoice_date event_type event_date
0 123 2017-10-01 2017-12-31 downgrade 2017-11-15
1 456 2017-03-01 2017-05-31 np.nan np.nan
2 789 2017-02-01 2017-04-30 downgrade 2017-02-15
3 333 2017-08-01 2017-10-31 np.nan np.nan
范围内有多个事件。可能还有任何其他invoice
个,或者没有任何事件,或者与主invoice
无关。
答案 0 :(得分:0)
我们使用merge
+布尔索引+ drop_duplicates
+ combine_first
(注意:新的是我在这里设置的参数,您可以使用drop('New',1)
来删除它结束)
target=df.merge(event_df,on='user_id',how='left')
target['New']=(target.event_date>=target.first_invoice_date)&(target.event_date<=target.third_invoice_date)
df.set_index('user_id').\
combine_first((target[target.New].drop_duplicates('user_id').\
set_index('user_id')))
Out[531]:
New event_date event_type first_invoice_date third_invoice_date
user_id
123 1.0 2017-11-15 downgrade 2017-10-01 2017-12-31
333 NaN NaT NaN 2017-08-01 2017-10-31
456 NaN NaT NaN 2017-03-01 2017-05-31
789 1.0 2017-02-15 downgrade 2017-02-01 2017-04-30