这是我几天来一直试图解决的问题。我想我可能要删除一些数据或其他东西,但老实说我不确定。我有一些看起来像这样的数据:
email Action ActionType TD cnt Date_Time
aaaa Company trial TD 1 10/12/14 19:17
aaaa Task Call 0 NA 10/13/14 17:00
bbbb Task Call 0 NA 12/9/14 16:17
bbbb Task Call 0 NA 12/9/14 16:17
bbbb Task Call 0 NA 12/10/14 16:31
bbbb Task Call 0 NA 12/12/14 16:45
bbbb Company demo TD 1 12/12/14 17:17
bbbb Event Demo TD 2 2/9/15 15:09
cccc Company trial TD 1 8/18/14 14:28
cccc Company demo TD 2 8/20/14 13:21
cccc Event Demo TD 3 2/9/15 15:08
dddd Company trial TD 1 12/14/14 0:09
eeee Company demo TD 1 8/27/14 21:57
eeee Event Demo TD 2 2/9/15 15:08
eeee Event Demo TD 3 2/9/15 15:08
ffff Company trial TD 1 3/19/14 21:15
gggg Company trial TD 1 7/30/14 18:06
hhhh Company trial TD 1 4/3/14 0:26
iiiii Company trial TD 1 5/29/14 20:10
iiiii Task Call 0 NA 5/29/14 22:01
jjjjj Task Call 0 NA 10/15/14 19:46
jjjjj Company trial TD 1 11/12/14 19:05
jjjjj Task Call 0 NA 11/12/14 19:16
jjjjj Task Call 0 NA 11/12/14 19:16
jjjjj Task Call 0 NA 11/12/14 19:31
jjjjj Task Call 0 NA 11/12/14 22:10
jjjjj Task Call 0 NA 11/13/14 19:46
jjjjj Task Call 0 NA 11/26/14 17:31
jjjjj Task Call 0 NA 11/26/14 17:31
jjjjj Task Call 0 NA 11/26/14 17:31
jjjjj Task Call 0 NA 11/26/14 17:31
kkkk Company trial TD 1 1/10/14 3:37
kkkk Task Call 0 NA 10/24/14 0:06
kkkk Task Call 0 NA 10/24/14 0:06
kkkk Task Call 0 NA 10/24/14 13:30
kkkk Company trial TD 2 10/27/14 12:45
kkkk Task Call 0 NA 1/23/15 14:31
kkkk Task Call 0 NA 1/26/15 21:15
kkkk Company Trial TD 3 1/27/15 21:15
目标是计算演示或试用与之前通话之间的时差。例如,我需要通过电子邮件地址找到第一个演示/试用版,然后回过头来计算该演示/试验与之前的通话之间的差异,然后计算该通话与之前的通话之间的差异,依此类推。
在第一次演示/试用之后,我不在乎任何电话,除非在几次通话后还有另一个演示/试用,然后该过程应该在第二次演示/试用时重新开始并计算第二次演示之间的差异/试用和以前的电话。我有专栏" TD"表示该行有一个演示/试用版。 " cnt" column是该电子邮件地址中出现的TD的编号。例如,如果同一封电子邮件背靠背有两个试验,那么在" cnt"中将会有1个然后是2个。该电子邮件地址的列。
所以基本上我希望数据看起来像这样:
email Action ActionType TD cnt Date_Time Time_Diff
aaaa Company trial TD 1 10/12/14 19:17
aaaa Task Call 0 NA 10/13/14 17:00
bbbb Task Call 0 NA 12/9/14 16:17
bbbb Task Call 0 NA 12/9/14 16:17 0
bbbb Task Call 0 NA 12/10/14 16:31 1 d 14 m
bbbb Task Call 0 NA 12/12/14 16:45 2 d 14 m
bbbb Company demo TD 1 12/12/14 17:17 32 m
bbbb Event Demo TD 2 2/9/15 15:09
cccc Company trial TD 1 8/18/14 14:28
cccc Company demo TD 2 8/20/14 13:21
cccc Event Demo TD 3 2/9/15 15:08
dddd Company trial TD 1 12/14/14 0:09
eeee Company demo TD 1 8/27/14 21:57
eeee Event Demo TD 2 2/9/15 15:08
eeee Event Demo TD 3 2/9/15 15:08
ffff Company trial TD 1 3/19/14 21:15
gggg Company trial TD 1 7/30/14 18:06
hhhh Company trial TD 1 4/3/14 0:26
iiiii Company trial TD 1 5/29/14 20:10
iiiii Task Call 0 NA 5/29/14 22:01
jjjjj Task Call 0 NA 10/15/14 19:46
jjjjj Company trial TD 1 11/12/14 19:05 27 d, 23 h, 19 m
jjjjj Task Call 0 NA 11/12/14 19:16
jjjjj Task Call 0 NA 11/12/14 19:16
jjjjj Task Call 0 NA 11/12/14 19:31
jjjjj Task Call 0 NA 11/12/14 22:10
jjjjj Task Call 0 NA 11/13/14 19:46
jjjjj Task Call 0 NA 11/26/14 17:31
jjjjj Task Call 0 NA 11/26/14 17:31
jjjjj Task Call 0 NA 11/26/14 17:31
jjjjj Task Call 0 NA 11/26/14 17:31
kkkk Company trial TD 1 1/10/14 3:37
kkkk Task Call 0 NA 10/24/14 0:06
kkkk Task Call 0 NA 10/24/14 0:06 0
kkkk Task Call 0 NA 10/24/14 13:30 13 h, 24 m
kkkk Company trial TD 2 10/27/14 12:45 2 d, 23 h, 15 m
kkkk Task Call 0 NA 1/23/15 14:31
kkkk Task Call 0 NA 1/26/15 21:15 3 d, 6 h, 44 m
kkkk Company trial TD 3 1/27/15 21:15 1 d
对我来说,如何格式化时差并不重要。
答案 0 :(得分:2)
在SQL中,这种数据操作可能更容易。要在R中执行此操作,您需要在dataframe上使用data.table。
以下解决方案并不是最优雅的,但它应该扩展。也许它会让你知道怎么做而不像我那样创建一堆新列。最糟糕的情况是,你可以把它放在一个循环中直到所有的TD被覆盖。
基本上我只是跨行执行了一系列条件语句。
setkey(dt,email)
dt[ActionType=="Call",call_times:=Date_Time] #Field with call times only for taking mins
dt[TD=="TD",TDtime:=Date_Time] # same thing with TD
dt[,first_call:=min(call_times,na.rm=TRUE),by=email] # date time of first call for all records from an email
legit<-unique(dt[TDtime>first_call,email]) # only keeping records for emails where there was a TD after the first call
dt<-dt[.(legit)]
dt<-dt[Date_Time>first_call|ActionType=="Call"] # also removing TDs that happened before first call
dt[,first_TD:=min(TDtime,na.rm=TRUE),by=email] # same with TD
dt[call_times>first_TD,call_times_2:=Date_Time] #find all calls after the first TD
dt[,second_call:=min(call_times_2,na.rm = TRUE),by=email] #find the time of the first call after the first TD
dt[TDtime>second_call,TDtimes_2:=Date_Time] #find all TDs after the second group of calls
dt[,second_TD:=min(TDtimes_2,na.rm=TRUE),by=email] #find the first TD after second group of calls starts
dt[Date_Time<=first_TD,call_group:=1] # group calls
dt[Date_Time>first_TD&Date_Time<=second_TD&second_TD!=Inf,call_group:=2]
dt[!is.na(call_group),time_diff:=c(0,(diff(as.numeric(Date_Time))/3600)),by=.(email,call_group)] #find lagged differences between the call times within each call group. (in hours)
dt[!is.na(time_diff),.(email,ActionType,Date_Time,time_diff)]
最后,您可以根据需要计算时差。我只是为了简单而花了好几个小时。
email ActionType Date_Time time_diff
1: bbbb Call 2014-12-09 16:17:00 0.0000000
2: bbbb Call 2014-12-09 16:17:00 0.0000000
3: bbbb Call 2014-12-10 16:31:00 24.2333333
4: bbbb Call 2014-12-12 16:45:00 48.2333333
5: bbbb demo 2014-12-12 17:17:00 0.5333333
6: jjjjj Call 2014-10-15 19:46:00 0.0000000
7: jjjjj trial 2014-11-12 19:05:00 672.3166667
8: kkkk Call 2014-10-24 00:06:00 0.0000000
9: kkkk Call 2014-10-24 00:06:00 0.0000000
10: kkkk Call 2014-10-24 13:30:00 13.4000000
11: kkkk trial 2014-10-27 12:45:00 71.2500000
12: kkkk Call 2015-01-23 14:31:00 0.0000000
13: kkkk Call 2015-01-26 21:15:00 78.7333333
14: kkkk Trial 2015-01-27 21:15:00 24.0000000