我正在处理有关在一定时期内经历不同情况的不同案件的数据。每个案例都有一个唯一的ID号。一个过程可以以多种方式开始,以“完成”方式结束(除了仍不进行的方式)。一个案件可以多次处理。 数据类似于此:
library(dplyr)
df1 <- structure(list(id = c("1", "1", "2", "2", "2", "2", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3"), time = structure(c(17453,
17458, 17453, 17462, 17727, 17735, 17453, 17484, 17568, 17665,
17665, 17709, 17727, 17727, 17757, 17819), class = "Date"), old_fase =
c(NA, "Fase 1", NA, "Fase 1", "Finished", "Fase 1", NA, "Fase 1", "Fase 2A",
"Finished", "Fase 2A", "Fase 2B", "Finished", "Fase 2B", "Fase 1",
"Fase 2A"), new_fase = c("Fase 1", "Finished", "Fase 1", "Finished",
"Fase 1", "Finished", "Fase 1", "Fase 2A", "Finished", "Fase 2A",
"Fase 2B", "Finished", "Fase 2B", "Fase 1", "Fase 2A", "Fase 2B"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -16L))
对于我的分析,我想基于每个id的每个过程的出现来创建一个新的id。使用group_by并在“ id”和“ new_fase”上进行mutate会创建以下不正确的解决方案。发生这种情况是因为在第11行中首次出现“ Fase 2B”。
df1 %>%
group_by(id,new_fase) %>%
mutate(occurrence=row_number())
正确的解决方案应如下所示:
df1 %>%
mutate(occurrence = c(rep(1, 4), 2, 2, rep(1, 3), rep(2, 3), rep(3, 4)))
我尝试了多种方法并阅读了多个Stackoverflow帖子,但我无法正确找出答案。感谢您的帮助,最好使用tidyverse解决方案。
答案 0 :(得分:3)
我们可以使用ave
中的base R
df2$occurrence <- with(df2, ave(seq_along(id), id, fase, FUN = seq_along))
或与data.table
library(data.table)
setDT(df2)[, occurrence := seq_len(.N), .(id, fase)]
答案 1 :(得分:2)
df3<- df1 %>%
group_by(id,fase) %>%
mutate(occurrence=row_number())
df3
# A tibble: 18 x 4
# Groups: id, fase [9]
id fase time occurrence
<dbl> <chr> <date> <int>
1 1 a 2018-01-01 1
2 1 b 2018-01-02 1
3 1 c 2018-01-03 1
4 2 a 2018-01-01 1
5 2 b 2018-01-02 1
6 2 c 2018-01-03 1
7 2 a 2018-01-04 2
8 2 b 2018-01-05 2
9 2 c 2018-01-06 2
10 2 a 2018-01-07 3
11 2 b 2018-01-08 3
12 2 c 2018-01-09 3
13 3 a 2018-01-01 1
14 3 b 2018-01-02 1
15 3 c 2018-01-03 1
16 3 a 2018-01-04 2
17 3 b 2018-01-05 2
18 3 c 2018-01-06 2
all(df2==df3)
[1] TRUE
您将df分解(分组)为每个部分具有相同ID和相位的部分,然后只需对每个部分中的行编号。请注意,这假设df已按照时间顺序排序,如样本数据中一样。如果不是这样,则必须提前按time
对其进行排序。
答案 2 :(得分:0)
我找到了这个临时解决方案(感谢第一个示例中使用group_by和mutate的iod解决方案)。
df1 %>% filter(is.na(old_fase) | old_fase == "Finished") %>% # indicates the beginning of a new proces
group_by(id) %>%
mutate(occurrence = row_number()) %>%
select(id, time, occurrence) %>%
left_join(df1, ., by = c("id", "time")) %>%
fill(occurrence)