筛选具有两个不同条件的个人?

时间:2020-11-11 20:37:58

标签: r filter dplyr subset

TLDR:需要通过两个不同的条件来过滤个人

基本上给出以下示例,我需要知道哪个人同时吃了奶酪和面包,并返回了与此相等的行。在示例中,这些是阿里巴巴,玛丽和史蒂夫。

通常,dplyr中的多个过滤条件非常简单,但这遍历了不同的行,因此我发现这很困难。我确实提出了一个长解决方案,但是我敢肯定有一种更有效的方法。

我正在处理一个大型数据集,因此速度至关重要。


set.seed(1111)
df = data.frame(ID = sample(c("bob","steve","mary","alibaba"),20,replace = TRUE))
                
set.seed(1311)                
df$food = sample(c("cheese","bread","olives"),20, replace = TRUE)

# finding which individuals have both cheese and bread
index = df %>% distinct(ID,food, .keep_all = TRUE) %>% 
  filter(food == "cheese" | food == "olives") %>% 
  group_by(ID) %>% 
  summarise(freq = n()) %>% 
  filter(freq > 1) %>% {as.vector(.$ID)}

# returning the rows for the individuals that have both cheese and bread
df %>% filter(ID %in% index,food == "cheese" | food == "olives")


1 个答案:

答案 0 :(得分:0)

在按“ ID”分组后,filter的那些同时具有“奶酪”,“橄榄”的组用all进行换行,并同时对第二个表达式({{ 1}})

food %in% c('cheese', 'olives')

-输出

library(dplyr) 
df %>%
     group_by(ID) %>%
     filter(all(c('cheese', 'olives') %in% food), food %in% c('cheese', 'olives'))

或者另一个可能更快的选择是先# A tibble: 13 x 2 # Groups: ID [3] # ID food # <chr> <chr> # 1 alibaba olives # 2 steve olives # 3 steve olives # 4 steve olives # 5 alibaba cheese # 6 steve olives # 7 steve olives # 8 mary cheese # 9 alibaba olives #10 mary olives #11 steve cheese #12 alibaba olives #13 steve olives ,然后进行分组并过滤“食物”中具有2个不同值的那些分组

filter

或带有df %>% filter(food %in% c('cheese', 'olives')) %>% group_by(ID) %>% filter(n_distinct(food) == 2)

的另一个选项
data.table