我有一个下面给出的数据表以及两个场景的输入。
DT1:
date item id
1: 2016-09-05 view 1
2: 2016-09-05 view 1
3: 2016-09-05 view 1
4: 2016-09-06 pv 1
5: 2016-09-06 pv 1
6: 2016-09-06 pv 1
7: 2016-09-06 check 1
8: 2016-09-06 check 1
9: 2016-09-06 check 1
10: 2016-09-06 check 1
dput1:
DT = setDT(structure(list(date = structure(c(17049, 17049, 17049, 17050,
17050, 17050, 17050, 17050, 17050, 17050), class = "Date"), item = c("view",
"view", "view", "pv", "pv", "pv", "check", "check", "check",
"check"), id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("date",
"item", "id"), row.names = c(NA, -10L), class = c("data.table",
"data.frame")))
DT2:
date item id
1: 2016-09-05 view 1
2: 2016-09-05 view 1
3: 2016-09-05 view 1
4: 2016-09-08 pv 1
5: 2016-09-06 pv1 1
6: 2016-09-06 pv2 1
7: 2016-09-06 check 1
8: 2016-09-06 check 1
9: 2016-09-06 check 1
10: 2016-09-06 check 1
dput2:
structure(list(date = structure(c(17049, 17049, 17049, 17050,
17050, 17050, 17050, 17050, 17050, 17050), class = "Date"), item = c("view",
"view", "view", "pv", "pv1", "pv2", "check", "check", "check",
"check"), id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("date",
"item", "id"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000210788>)
我正在尝试在&#39;项目中找到第一次出现pv的日期。列并提取该条目的相应日期,然后在&#39;项目中提取第一次检查的日期。通过id,获取天数差异并存储在新变量中。
如果有多种情况我们需要检查&#39; pv&#39;例如,如果&#39; pv&#39;然后不在列表中&#39; pv1&#39;可以检查或&#39; pv2&#39;这个想法是第一次出现。因此,如果有pv,pv1和pv2但pv2首先出现,那么应该采用对应于pv2的日期。同样,它可能只是&#39; pv2&#39;或者&#39; pv1&#39;或者&#39; pv&#39;存在于项目列中。我们如何执行检查以从三种可能性中取出第一次出现并提取日期。有什么想法吗?
使用数据表或%>%
寻找使用最少代码完成任务的想法和建议。
答案 0 :(得分:4)
如果'dt'是'data.table'对象,在按'id'分组后,我们得到第一次出现'pv'(which.max(item=="pv")
)和'check'的索引,' date'基于该索引,减去它并将其分配(:=
)到新变量'Diff'。
dt[, Diff := date[which.max(item == "pv")]- date[which.max(item =="check")], by = id]
或者代替which.max
,使用match
获取索引
dt[, Diff := date[match("pv", item)] - date[match("check", item)], by = id]
注1:假设所有'id'至少有一个'pv'和'check'。
注2:如果我们需要特定单位的差异,请使用difftime
并指定units