计算r中值组合的出现次数

时间:2018-11-01 14:58:10

标签: r dplyr tidyverse

我正在处理有关在一定时期内经历不同情况的不同案件的数据。每个案例都有一个唯一的ID号。一个过程可以以多种方式开始,以“完成”方式结束(除了仍不进行的方式)。一个案件​​可以多次处理。 数据类似于此:

library(dplyr)
df1 <- structure(list(id = c("1", "1", "2", "2", "2", "2", "3", "3", 
"3", "3", "3", "3", "3", "3", "3", "3"), time = structure(c(17453, 
17458, 17453, 17462, 17727, 17735, 17453, 17484, 17568, 17665, 
17665, 17709, 17727, 17727, 17757, 17819), class = "Date"), old_fase = 
c(NA, "Fase 1", NA, "Fase 1", "Finished", "Fase 1", NA, "Fase 1", "Fase 2A", 
"Finished", "Fase 2A", "Fase 2B", "Finished", "Fase 2B", "Fase 1", 
"Fase 2A"), new_fase = c("Fase 1", "Finished", "Fase 1", "Finished", 
"Fase 1", "Finished", "Fase 1", "Fase 2A", "Finished", "Fase 2A", 
"Fase 2B", "Finished", "Fase 2B", "Fase 1", "Fase 2A", "Fase 2B"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -16L))

对于我的分析,我想基于每个id的每个过程的出现来创建一个新的id。使用group_by并在“ id”和“ new_fase”上进行mutate会创建以下不正确的解决方案。发生这种情况是因为在第11行中首次出现“ Fase 2B”。

df1 %>% 
group_by(id,new_fase) %>% 
mutate(occurrence=row_number())

正确的解决方案应如下所示:

df1 %>% 
mutate(occurrence = c(rep(1, 4), 2, 2, rep(1, 3), rep(2, 3), rep(3, 4)))

我尝试了多种方法并阅读了多个Stackoverflow帖子,但我无法正确找出答案。感谢您的帮助,最好使用tidyverse解决方案。

3 个答案:

答案 0 :(得分:3)

我们可以使用ave中的base R

df2$occurrence <- with(df2, ave(seq_along(id), id, fase, FUN = seq_along))

或与data.table

library(data.table)
setDT(df2)[, occurrence := seq_len(.N), .(id, fase)]

答案 1 :(得分:2)

df3<- df1 %>% 
  group_by(id,fase) %>% 
  mutate(occurrence=row_number())

df3
# A tibble: 18 x 4
# Groups:   id, fase [9]
      id fase  time       occurrence
   <dbl> <chr> <date>          <int>
 1     1 a     2018-01-01          1
 2     1 b     2018-01-02          1
 3     1 c     2018-01-03          1
 4     2 a     2018-01-01          1
 5     2 b     2018-01-02          1
 6     2 c     2018-01-03          1
 7     2 a     2018-01-04          2
 8     2 b     2018-01-05          2
 9     2 c     2018-01-06          2
10     2 a     2018-01-07          3
11     2 b     2018-01-08          3
12     2 c     2018-01-09          3
13     3 a     2018-01-01          1
14     3 b     2018-01-02          1
15     3 c     2018-01-03          1
16     3 a     2018-01-04          2
17     3 b     2018-01-05          2
18     3 c     2018-01-06          2

all(df2==df3)
[1] TRUE

您将df分解(分组)为每个部分具有相同ID和相位的部分,然后只需对每个部分中的行编号。请注意,这假设df已按照时间顺序排序,如样本数据中一样。如果不是这样,则必须提前按time对其进行排序。

答案 2 :(得分:0)

我找到了这个临时解决方案(感谢第一个示例中使用group_by和mutate的iod解决方案)。

df1 %>% filter(is.na(old_fase) | old_fase == "Finished") %>% # indicates the beginning of a new proces
group_by(id) %>% 
mutate(occurrence = row_number()) %>% 
select(id, time, occurrence) %>% 
left_join(df1, ., by = c("id", "time")) %>% 
fill(occurrence)