单独过去两周我遇到过这个问题两次,所以我觉得值得发帖。我试图在data.table
中识别“运行”,但我无法找到一种优雅的方法来实现它。
set.seed(2016)
dt <- data.table(ID = 1:50, Char = sample(LETTERS, 50, replace=TRUE))
dt <- dt[order(Char, ID)]
ID Char
1: 9 A
2: 10 B
3: 20 C
4: 42 C
5: 2 D
6: 4 D
7: 6 D
8: 18 D
...
在这里,我想识别并分组ID在上/下行的2个范围内的行。这是我目前难看的解决方案
# Runs of 2 or more IDs within 2 of each other
dt[, `:=`(InRun = FALSE, InRunStart = FALSE)]
dt[abs(ID - shift(ID, type="lag")) <= 2 | abs(shift(ID, type="lead") - ID) <= 2, InRun := TRUE]
dt[InRun == TRUE & abs(ID - shift(ID, type="lag")) > 2 | is.na(shift(ID, type="lag")), InRunStart := TRUE]
dt[InRun == TRUE, RunID := cumsum(InRunStart)]
dt[, c("InRun", "InRunStart") := NULL]
dt
ID Char RunID
1: 9 A 1
2: 10 B 1
3: 20 C NA
4: 42 C NA
5: 2 D 2
6: 4 D 2
7: 6 D 2
8: 18 D NA
...
有更好的方法吗?
编辑:似乎对我如何定义“运行”感到困惑。更明确地说,当且仅当它们的ID在2的距离内时,row_i和row_i + 1应具有相同的RunID。
答案 0 :(得分:3)
我会在制作此运行ID后停止:
dt[, run_id0 := 1L + cumsum(abs(ID - shift(ID, fill=ID[1L])) > 2)]
但要获得OP的运行ID(忽略长度为1的运行),可以采用以下几种方法:
dt[duplicated(run_id0) | duplicated(run_id0, fromLast=TRUE), run_id1 := .GRP, by=run_id0 ]
# or
dt[, run_len := .N, by=run_id0 ][ run_len > 1L, run_id2 := .GRP, by=run_id0 ]
答案 1 :(得分:1)
不知道这是否优雅,但是如何:
dt <- data.table(ID = c(9, 10, 15, 18, 21, 22, 25))
run_ids <- abs(dt[1:(.N-1), ID] - dt[2:.N, ID]) <= 2
run_ids <- c(run_ids[1], run_ids)
foo <- with(rle(run_ids), rep(cumsum(values) * values, lengths))
foo[foo == 0] = foo[which(foo == 0) + 1]
dt[, RunID := foo]
dt[RunID == 0, RunID := NA]
# ID RunID
# 1: 9 1
# 2: 10 1
# 3: 15 NA
# 4: 18 NA
# 5: 21 2
# 6: 22 2
# 7: 25 NA