我的数据如下:
ID1 ID2 Time diff
1: 1958616 P209576 4/15/2016 7:46 NA mins
2: 1958493 P209580 3/23/2016 9:41 -33005.16793 mins
3: 1958493 P209580 3/25/2016 15:41 3240.09742 mins
4: 1958493 P209580 3/30/2016 10:22 6880.65360 mins
5: 1958492 P209580 3/30/2016 11:31 69.00078 mins
6: 1958493 P209580 4/11/2016 10:07 17196.62313 mins
我想要做的是将所有在彼此的8小时时间窗口内发生的ID分组,并在分组后返回不同ID1的数量。在上面的例子中,第4行和第4行5将从abs(diff) < 60*8
开始分组。
我使用data[, diff := TIME - shift(TIME)]
生成diff列。
我的理想输出看起来像这样
num_of_unique_id1 ID2 Initial_time
1: 1 P209576 4/15/2016 7:46
2: 1 P209580 3/23/2016 9:41
3: 1 P209580 3/25/2016 15:41
4: 2 P209580 3/30/2016 10:22
5: 1 P209580 4/11/2016 10:07
我认为可以使用num_of_unique_id1
和.SD
创建length(unique(ID1))
,但不知道如何为by =
参数创建列。
我知道也会出现边界问题(A在B的8小时内,B在C的8小时内,但A和C相隔超过8小时)在这些情况下,我想我想将它们分成一行。
dput(data)
structure(list(ID1 = c("1958616", "1958493", "1958493", "1958493",
"1958492", "1958493"), ID2 = c("P209576", "P209580", "P209580",
"P209580", "P209580", "P209580"), Time = structure(c(1460706387.438,
1458726077.362, 1458920483.207, 1459333322.423, 1459337462.47,
1460369259.858), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
diff = structure(c(NA, -33005.1679333329, 3240.09741666714,
6880.65360000133, 69.0007833321889, 17196.6231333335), units = "mins", class = "difftime")), .Names = c("ID1",
"ID2", "Time", "diff"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x1ce9a28>)
答案 0 :(得分:4)
如果数据按时排序,我们可以使用cumsum
计算差异并分配唯一的组。
data <- data[order(Time)]
data[ , diff := NULL] # we will re-compute diff in hours
data[ , diff_hours := as.numeric(c(0, diff(Time)))]
## ID1 ID2 Time diff_hours
## 1: 1958493 P209580 2016-03-23 09:41:17 0.000000
## 2: 1958493 P209580 2016-03-25 15:41:23 54.001624
## 3: 1958493 P209580 2016-03-30 10:22:02 114.677560
## 4: 1958492 P209580 2016-03-30 11:31:02 1.150013
## 5: 1958493 P209580 2016-04-11 10:07:39 286.610386
## 6: 1958616 P209576 2016-04-15 07:46:27 93.646550
window <- 8 # the time window in hours
data[ , group := cumsum(diff_hours > window) + 1]
data[ , num_of_unique_id1 := uniqueN(ID1), by = group]
## ID1 ID2 Time diff_hours group num_of_unique_id1
## 1: 1958493 P209580 2016-03-23 09:41:17 0.000000 1 1
## 2: 1958493 P209580 2016-03-25 15:41:23 54.001624 2 1
## 3: 1958493 P209580 2016-03-30 10:22:02 114.677560 3 2
## 4: 1958492 P209580 2016-03-30 11:31:02 1.150013 3 2
## 5: 1958493 P209580 2016-04-11 10:07:39 286.610386 4 1
## 6: 1958616 P209576 2016-04-15 07:46:27 93.646550 5 1
请注意,2016-03-30
内彼此约一小时内的两个数据点分配相同的group
,num_of_unique_id1
(每组)为2,而所有其他数据积分在他们自己的小组中。