structure(list(group = c(NA, "A", "B", NA, "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", NA, NA, "B", "B", "A", "A", NA, NA, "B", "B", "B", NA, "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", NA, NA, "B", "B",
NA, "A"), seq_break = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)), .Names = c("group",
"seq_break"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-50L))
在上面的数据中,我需要定义一个包含group
列的游程类型ID的列(例如data.table::rleid
产生,但忽略NA
)。如您所见,我们还有seq_break
列,该列应结束一个序列。通常,就像group = NA
然后seq_break = TRUE
一样。但是有时seq_break = TRUE
和组是A
或B
-那么,即使下一行引用相同的组,也应结束序列并开始新的序列。因此,例如对于行25:26
,即使两个事件都指向组B
,我们也应具有两个不同的序列ID。通常,预期输出如下所示:
structure(list(group = c(NA, "A", "B", NA, "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", NA, NA, "B", "B", "A", "A", NA, NA, "B", "B", "B", NA, "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", NA, NA, "B", "B",
NA, "A"), seq_break = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE), expected_output = c(NA,
1, 2, NA, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, NA, NA, 4, 5, 6, 6, NA, NA, 7, 7, 7, NA, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, NA, NA, 11, 11, NA, 12)), .Names = c("group", "seq_break",
"expected_output"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-50L))
如何使用tidyverse
实现这一目标?
答案 0 :(得分:2)
使用tidyverse
和data.table
的解决方案。假设dt1
是示例数据帧,而dt3
是最终输出。请注意,我认为在预期输出中,第47至48行应为9,第50行应为10。我不确定为什么在您的预期输出中,行47至48为11而第50行为12。
library(tidyverse)
library(data.table)
dt2 <- dt1 %>% rowid_to_column()
dt3 <- dt2 %>%
mutate(ID = rleid(group, seq_break)) %>%
group_by(group, seq_break, ID) %>%
filter(!(is.na(group) & seq_break & row_number() > 1)) %>%
ungroup() %>%
mutate(ID2 = cumsum(seq_break)) %>%
drop_na(group) %>%
mutate(expected_output = rleid(group, ID2)) %>%
select(rowid, expected_output) %>%
left_join(dt2, ., by = "rowid") %>%
select(-rowid)
dt3
# # A tibble: 50 x 3
# group seq_break expected_output
# <chr> <lgl> <int>
# 1 NA TRUE NA
# 2 A FALSE 1
# 3 B FALSE 2
# 4 NA TRUE NA
# 5 B FALSE 3
# 6 B FALSE 3
# 7 B FALSE 3
# 8 B FALSE 3
# 9 B FALSE 3
# 10 B FALSE 3
# # ... with 40 more rows