识别分组中的缺失观测值

时间:2019-09-27 12:14:20

标签: r data-manipulation

我的代码有一些困难,希望你们中的一些人能帮忙。

数据集看起来像这样:

df <- data.frame("group" = c("A", "A", "A","A_1", "A_1", "B","B","B_1"), 
                 "id" = c("id1", "id2", "id3", "id2", "id3", "id5","id1","id1"), 
                 "time" = c(1,1,1,3,3,2,2,5),
                 "Val" = c(10,10,10,10,10,12,12,12))

“组”表示个人“ id”所在的组。“ A_1”表示对象已离开该组。

例如,一个主题“ id1”离开“组A”,成为“ A_1”组,其中只有“ id2”和“ id3”是成员。同样,“ id5”离开仅以id1为成员的B组,成为“ B_1”。

我想在最终数据集中拥有的是相反类型的组标识,应该看起来像这样:

final <- data.frame("group" = c("A", "A", "A","A_1", "B","B","B_1"), 
                     "id" = c("id1", "id2", "id3", "id1", "id5","id1","id5"), 
                     "time" = c(1,1,1,3,2,2,5),
                     "Val" = c(10,10,10,10,12,12,12),
                     "groupid" = c("A", "A", "A","A", "B","B","B"))

“ A_1”和“ B_1”分别仅表示离开原始组的主题“ id1”和“ id5”,而不标识其余主题。

有人对我如何系统地做到这一点有建议吗?

在此先感谢您的帮助。


跟进:

我的数据比上面的示例复杂一些,因为有多个“退出”事件,因此组标识符可以具有不同的字符长度(例如AAA和B)。数据看起来更像如下:

df2 <- data.frame("group" = c("AAA", "AAA", "AAA","AAA","AAA_1","AAA_1", "AAA_1","AAA_2","AAA_2","B","B","B_1"), 
                  "id" = c("id1", "id2", "id3","id4", "id2", "id3","id4", "id2","id3", "id5","id1","id1"), 
                  "time" = c(1,1,1,1,3,3,3,6,6,2,2,5),
                  "Val" = c(10,10,10,10,10,10,10,10,10,12,12,12))

在时间3,id1离开组AAA,成为组AAA_1,而在时间6,id4离开组AAA,成为组AAA_2。如前所述,我希望带有“ _”的组标识离开该组而不是剩下的那个ID。因此,最终数据集应如下所示:

final2 <- data.frame("group" = c("A", "A", "A","A","A_1","A_2",                     
                              "B","B","B_1"), 
                  "id" = c("id1", "id2", "id3","id4", "id1", "id4", "id5","id1","id5"), 
                  "time" = c(1,1,1,1,3,6,2,2,5),
                  "Val" = c(10,10,10,10,10,10,12,12,12))

感谢您的帮助

1 个答案:

答案 0 :(得分:2)

好吧,您可以通过以下方式尝试使用dplyr:也许它并不优雅,但是您可以得到结果。背后的想法是先获取group ...中的内容,而不获取相对..._1中的内容,然后更改其group,获取其他内容,然后{{1} }在一起:

rbind

最后,library(dplyr) # first you could find the one that are missing in the ..._1 groups # and change their group to ..._1 dups <- df %>% group_by(id, groupid = substr(group,1,1)) %>% filter(n() == 1)%>% mutate(group = paste0(group,'_1')) %>% left_join(df %>% select(group, time, Val) %>% distinct(), by ='group') %>% select(group, id, time = time.y, Val = Val.y) %>% ungroup() dups # A tibble: 2 x 5 groupid group id time Val <chr> <chr> <fct> <dbl> <dbl> 1 A A_1 id1 3 10 2 B B_1 id5 5 12 # now you can select the ones that are in both groups: dups2 <- df %>% filter(nchar(as.character(group)) == 1) %>% mutate(groupid = substr(group,1,1)) dups2 group id time Val groupid 1 A id1 1 10 A 2 A id2 1 10 A 3 A id3 1 10 A 4 B id5 2 12 B 5 B id1 2 12 B 个,rbind()个,arrange()列:

order()

希望有帮助!


编辑

您可以通过一些工作来概括它,这是我的尝试,希望对您有所帮助。

rbind(dups, dups2) %>%
 arrange(group) %>%
 select(group, id, time, Val, groupid)

# A tibble: 7 x 5
  group id     time   Val groupid
  <chr> <fct> <dbl> <dbl> <chr>  
1 A     id1       1    10 A      
2 A     id2       1    10 A      
3 A     id3       1    10 A      
4 A_1   id1       3    10 A      
5 B     id5       2    12 B      
6 B     id1       2    12 B      
7 B_1   id5       5    12 B 

现在我们首先找到谁在更改,然后再更改:想法与上一部分相同:

library(dplyr)
df3 <- df2

# you have to set a couple of fields you need:
df3$group <-ifelse(
  substr(df2$group,(nchar(as.character(df2$group))+1)-1,nchar(as.character(df2$group))) %in% c(0:9),
  paste0(substr(df2$group,1,1),"_",substr(df2$group,(nchar(as.character(df2$group))+1)-1,nchar(as.character(df2$group)))),
  paste0(substr(df2$group,1,1),"_0")
  )

df3$util <- as.numeric(substr(df3$group,3,3))+1

# two empty lists to populate with a nested loop:
changed <- list()
final_changed <- list()

然后将遗骸放在一起:

for (j in c("A","B")) {
 df3_ <- df3[substr(df3$group,1,1)==j,] 
 for (i in unique(df3_$util)[1:length(unique(df3_$util))-1]) {
       temp1 <- df3_[df3_$util == i,]
       temp2 <- df3_[df3_$util == i+1,]
       changes <- temp1[!temp1$id %in% temp2$id,]
       changes$group <- paste0(j,'_',i )
       changes <- changes %>% left_join(temp2, by = 'group') %>% 
                  select(group , id = id.x, time = time.y, Val = Val.y)

     changed[[i]] <- changes
     }
  final_changed[[j]] <- changed
  }

change <- do.call(rbind,(do.call(Map, c(f = rbind, final_changed)))) %>% distinct()
change
  group  id time Val
1   A_1 id1    3  10
2   B_1 id5    5  12
3   A_2 id4    6  10