按应用程序和用户ID分组后,检索特定文本的所有行

时间:2018-11-21 16:40:41

标签: r dplyr data-manipulation

用户完成数字步骤后,列is_digitally_signed变为YES。 我正在尝试做的是:如果数字完成了任何步骤,我想检索相同的application_iduser_id的所有行。请检查下面我想要的输出。

用于复制我的数据集的R代码

df <- data.table(application_id = c(1,1,1,2,2,2,3,3,3), 
                 user_id = c(123,123,123,456,456,456,789,789,789), 
                 application_status = c("incomplete", "details_verified", "complete"),
                 date = c("01/01/2018", "02/01/2018", "03/01/2018"),
                 is_digitally_signed = c("NULL", "NULL", "YES", "NULL", "NULL", "NULL", "NULL", "YES", "NULL")) %>%
  mutate(date = as.Date(date, "%d/%m/%Y"))

有输出

df
  application_id user_id application_status       date is_digitally_signed
              1     123         incomplete  2018-01-01                NULL
              1     123   details_verified  2018-01-02                NULL
              1     123           complete  2018-01-03                 YES
              2     456         incomplete  2018-01-01                NULL
              2     456   details_verified  2018-01-02                NULL
              2     456           complete  2018-01-03                NULL
              3     789         incomplete  2018-01-01                NULL
              3     789   details_verified  2018-01-02                 YES
              3     789           complete  2018-01-03                NULL

我的(失败的)努力

df %>% group_by(application_id,user_id) %>% filter_all(all.vars(. == "YES"))

所需结果

application_id user_id application_status       date is_digitally_signed
              1     123         incomplete 2018-01-01                NULL
              1     123   details_verified 2018-01-02                NULL
              1     123           complete 2018-01-03                 YES
              3     789         incomplete 2018-01-01                NULL
              3     789   details_verified 2018-01-02                 YES
              3     789           complete 2018-01-03                NULL

2 个答案:

答案 0 :(得分:3)

dplyr

我们可以将filterany一起使用,这将检查给定的组是否至少有一条is_digitally_signed == 'YES'记录:

library(dplyr)

df %>% 
  group_by(application_id, user_id) %>%
  filter(any(is_digitally_signed == "YES"))

或使用all函数对不是所有is_digitally_signed == "NULL"的组进行子集化:

df %>% 
  group_by(application_id, user_id) %>%
  filter(!all(is_digitally_signed == "NULL"))

data.table

由于您已经将数据作为DT加载,因此我们也可以使用data.table

library(data.table)
dt = setDT(df)
dt[dt[,.I[any(is_digitally_signed == "YES")], by=.(application_id, user_id)]$V1,]

或使用.SD

dt[,.SD[any(is_digitally_signed == "YES")], by=.(application_id, user_id)]

输出:

# A tibble: 6 x 5
# Groups:   application_id, user_id [2]
  application_id user_id application_status date       is_digitally_signed
           <dbl>   <dbl> <fct>              <date>     <fct>              
1              1     123 incomplete         2018-01-01 NULL               
2              1     123 details_verified   2018-01-02 NULL               
3              1     123 complete           2018-01-03 YES                
4              3     789 incomplete         2018-01-01 NULL               
5              3     789 details_verified   2018-01-02 YES                
6              3     789 complete           2018-01-03 NULL

答案 1 :(得分:3)

由于只有一列要测试,因此我们可以简单地将filterany一起使用

library(dplyr)
df %>% 
   group_by(application_id,user_id) %>% 
    filter(any(is_digitally_signed  == "YES"))
# A tibble: 6 x 5
# Groups:   application_id, user_id [2]
#  application_id user_id application_status date       is_digitally_signed
#           <dbl>   <dbl> <chr>              <date>     <chr>              
#1              1     123 incomplete         2018-01-01 NULL               
#2              1     123 details_verified   2018-01-02 NULL               
#3              1     123 complete           2018-01-03 YES                
#4              3     789 incomplete         2018-01-01 NULL               
#5              3     789 details_verified   2018-01-02 YES                
#6              3     789 complete           2018-01-03 NULL               

或者另一个选择是使用%in%返回单个TRUE/FALSE输出,该输出将被回收

df %>% 
   group_by(application_id,user_id) %>% 
   filter("YES" %in% is_digitally_signed)

或者我们可以使用base R

df[with(df, ave(is_digitally_signed == "YES", application_id,user_id, FUN = any)),]