计算R中某些事件之间的时差

时间:2015-03-27 18:31:18

标签: r

这是我几天来一直试图解决的问题。我想我可能要删除一些数据或其他东西,但老实说我不确定。我有一些看起来像这样的数据:

email   Action  ActionType    TD    cnt     Date_Time
aaaa    Company trial          TD   1   10/12/14 19:17
aaaa    Task    Call           0    NA  10/13/14 17:00
bbbb    Task    Call           0    NA  12/9/14 16:17
bbbb    Task    Call           0    NA  12/9/14 16:17
bbbb    Task    Call           0    NA  12/10/14 16:31
bbbb    Task    Call           0    NA  12/12/14 16:45
bbbb    Company demo           TD   1   12/12/14 17:17
bbbb    Event   Demo           TD   2   2/9/15 15:09
cccc    Company trial          TD   1   8/18/14 14:28
cccc    Company demo           TD   2   8/20/14 13:21
cccc    Event   Demo           TD   3   2/9/15 15:08
dddd    Company trial          TD   1   12/14/14 0:09
eeee    Company demo           TD   1   8/27/14 21:57
eeee    Event   Demo           TD   2   2/9/15 15:08
eeee    Event   Demo           TD   3   2/9/15 15:08
ffff    Company trial          TD   1   3/19/14 21:15
gggg    Company trial          TD   1   7/30/14 18:06
hhhh    Company trial          TD   1   4/3/14 0:26
iiiii   Company trial          TD   1   5/29/14 20:10
iiiii   Task    Call           0    NA  5/29/14 22:01
jjjjj   Task    Call           0    NA  10/15/14 19:46
jjjjj   Company trial          TD   1   11/12/14 19:05
jjjjj   Task    Call           0    NA  11/12/14 19:16
jjjjj   Task    Call           0    NA  11/12/14 19:16
jjjjj   Task    Call           0    NA  11/12/14 19:31
jjjjj   Task    Call           0    NA  11/12/14 22:10
jjjjj   Task    Call           0    NA  11/13/14 19:46
jjjjj   Task    Call           0    NA  11/26/14 17:31
jjjjj   Task    Call           0    NA  11/26/14 17:31
jjjjj   Task    Call           0    NA  11/26/14 17:31
jjjjj   Task    Call           0    NA  11/26/14 17:31
kkkk    Company trial          TD   1   1/10/14 3:37
kkkk    Task    Call           0    NA  10/24/14 0:06
kkkk    Task    Call           0    NA  10/24/14 0:06
kkkk    Task    Call           0    NA  10/24/14 13:30
kkkk    Company trial          TD   2   10/27/14 12:45
kkkk    Task    Call           0    NA  1/23/15 14:31
kkkk    Task    Call           0    NA  1/26/15 21:15
kkkk    Company Trial          TD   3   1/27/15 21:15

目标是计算演示或试用与之前通话之间的时差。例如,我需要通过电子邮件地址找到第一个演示/试用版,然后回过头来计算该演示/试验与之前的通话之间的差异,然后计算该通话与之前的通话之间的差异,依此类推。

在第一次演示/试用之后,我不在乎任何电话,除非在几次通话后还有另一个演示/试用,然后该过程应该在第二次演示/试用时重新开始并计算第二次演示之间的差异/试用和以前的电话。我有专栏" TD"表示该行有一个演示/试用版。 " cnt" column是该电子邮件地址中出现的TD的编号。例如,如果同一封电子邮件背靠背有两个试验,那么在" cnt"中将会有1个然后是2个。该电子邮件地址的列。

所以基本上我希望数据看起来像这样:

email   Action  ActionType  TD  cnt     Date_Time   Time_Diff
aaaa    Company trial       TD  1   10/12/14 19:17  
aaaa    Task    Call        0   NA  10/13/14 17:00  
bbbb    Task    Call        0   NA  12/9/14 16:17   
bbbb    Task    Call        0   NA  12/9/14 16:17   0
bbbb    Task    Call        0   NA  12/10/14 16:31  1 d 14 m
bbbb    Task    Call        0   NA  12/12/14 16:45  2 d 14 m
bbbb    Company demo        TD  1   12/12/14 17:17  32 m
bbbb    Event   Demo        TD  2   2/9/15 15:09    
cccc    Company trial       TD  1   8/18/14 14:28   
cccc    Company demo        TD  2   8/20/14 13:21   
cccc    Event   Demo        TD  3   2/9/15 15:08    
dddd    Company trial       TD  1   12/14/14 0:09   
eeee    Company demo        TD  1   8/27/14 21:57   
eeee    Event   Demo        TD  2   2/9/15 15:08    
eeee    Event   Demo        TD  3   2/9/15 15:08    
ffff    Company trial       TD  1   3/19/14 21:15   
gggg    Company trial       TD  1   7/30/14 18:06   
hhhh    Company trial       TD  1   4/3/14 0:26 
iiiii   Company trial       TD  1   5/29/14 20:10   
iiiii   Task    Call        0   NA  5/29/14 22:01   
jjjjj   Task    Call        0   NA  10/15/14 19:46  
jjjjj   Company trial       TD  1   11/12/14 19:05  27 d, 23 h, 19 m
jjjjj   Task    Call        0   NA  11/12/14 19:16  
jjjjj   Task    Call        0   NA  11/12/14 19:16  
jjjjj   Task    Call        0   NA  11/12/14 19:31  
jjjjj   Task    Call        0   NA  11/12/14 22:10  
jjjjj   Task    Call        0   NA  11/13/14 19:46  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
kkkk    Company trial       TD  1   1/10/14 3:37    
kkkk    Task    Call        0   NA  10/24/14 0:06   
kkkk    Task    Call        0   NA  10/24/14 0:06   0
kkkk    Task    Call        0   NA  10/24/14 13:30  13 h, 24 m
kkkk    Company trial       TD  2   10/27/14 12:45  2 d, 23 h, 15 m
kkkk    Task    Call        0   NA  1/23/15 14:31   
kkkk    Task    Call        0   NA  1/26/15 21:15   3 d, 6 h, 44 m
kkkk    Company trial       TD  3   1/27/15 21:15   1 d

对我来说,如何格式化时差并不重要。

1 个答案:

答案 0 :(得分:2)

在SQL中,这种数据操作可能更容易。要在R中执行此操作,您需要在dataframe上使用data.table。

以下解决方案并不是最优雅的,但它应该扩展。也许它会让你知道怎么做而不像我那样创建一堆新列。最糟糕的情况是,你可以把它放在一个循环中直到所有的TD被覆盖。

基本上我只是跨行执行了一系列条件语句。

setkey(dt,email)

dt[ActionType=="Call",call_times:=Date_Time] #Field with call times only for taking mins
dt[TD=="TD",TDtime:=Date_Time] # same thing with TD
dt[,first_call:=min(call_times,na.rm=TRUE),by=email] # date time of first call for all records from an email
legit<-unique(dt[TDtime>first_call,email]) # only keeping records for emails where there was a TD after the first call
dt<-dt[.(legit)] 
dt<-dt[Date_Time>first_call|ActionType=="Call"] # also removing TDs that happened before first call
dt[,first_TD:=min(TDtime,na.rm=TRUE),by=email] # same with TD
dt[call_times>first_TD,call_times_2:=Date_Time] #find all calls after the first TD
dt[,second_call:=min(call_times_2,na.rm = TRUE),by=email] #find the time of the first call after the first TD
dt[TDtime>second_call,TDtimes_2:=Date_Time] #find all TDs after the second group of calls
dt[,second_TD:=min(TDtimes_2,na.rm=TRUE),by=email] #find the first TD after second group of calls starts

dt[Date_Time<=first_TD,call_group:=1] # group calls
dt[Date_Time>first_TD&Date_Time<=second_TD&second_TD!=Inf,call_group:=2] 

dt[!is.na(call_group),time_diff:=c(0,(diff(as.numeric(Date_Time))/3600)),by=.(email,call_group)] #find lagged differences between the call times within each call group. (in hours)
dt[!is.na(time_diff),.(email,ActionType,Date_Time,time_diff)] 

最后,您可以根据需要计算时差。我只是为了简单而花了好几个小时。

  email ActionType           Date_Time   time_diff
1:  bbbb       Call 2014-12-09 16:17:00   0.0000000
2:  bbbb       Call 2014-12-09 16:17:00   0.0000000
3:  bbbb       Call 2014-12-10 16:31:00  24.2333333
4:  bbbb       Call 2014-12-12 16:45:00  48.2333333
5:  bbbb       demo 2014-12-12 17:17:00   0.5333333
6: jjjjj       Call 2014-10-15 19:46:00   0.0000000
7: jjjjj      trial 2014-11-12 19:05:00 672.3166667
8:  kkkk       Call 2014-10-24 00:06:00   0.0000000
9:  kkkk       Call 2014-10-24 00:06:00   0.0000000
10:  kkkk       Call 2014-10-24 13:30:00  13.4000000
11:  kkkk      trial 2014-10-27 12:45:00  71.2500000
12:  kkkk       Call 2015-01-23 14:31:00   0.0000000
13:  kkkk       Call 2015-01-26 21:15:00  78.7333333
14:  kkkk      Trial 2015-01-27 21:15:00  24.0000000