通过包含重复元素的向量过滤数据集

时间:2018-04-06 15:23:50

标签: r filter subset

强制性“抱歉不透明标题”消息。

我有data.frame

df <- data.frame( l = rep(letters[1:3], each=3) , 
                  n = rep(1:3, 3)
                 )

我想通过分组变量l从单独的向量对数据进行子集化,例如:

df[df$l %in% c("a","b"),]

这有效,但现在想象我想使用向量c("a","b","a","a","c","c")进行子集化。当我使用R的%in%运算符尝试此操作时,它只返回带有向量的唯一元素的行:

df[df$l %in% c("a","b","a","a","c","c"),]

  l n
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3

是否有替代%in%使用带有重复元素的向量通过分组变量过滤data.frame

编辑:要清楚,在上面的第二种情况中我想得到:

  l n
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 a 1
8 a 2
9 a 3 
10 a 1
11 a 2 
12 a 3
13 c 1
14 c 2
15 c 3
16 c 1 
17 c 2
18 c 3

3 个答案:

答案 0 :(得分:1)

必须有更好的方法,但我认为这会产生正确的结果。

do.call(rbind, lapply(c("a","b","a","a","c","c"), function(x) df %>% filter(l == x)))

通过每个字母和过滤器的向量,然后将结果列表绑定到数据框中。 dplyr%>%需要filter

#    l n
# 1  a 1
# 2  a 2
# 3  a 3
# 4  b 1
# 5  b 2
# 6  b 3
# 7  a 1
# 8  a 2
# 9  a 3
# 10 a 1
# 11 a 2
# 12 a 3
# 13 c 1
# 14 c 2
# 15 c 3
# 16 c 1
# 17 c 2
# 18 c 3

为了使它更容易使用,您可以定义一个运算符:

"%filter%" <- function(df, search_list){
  do.call(rbind, lapply(search_list, function(x) df %>% filter(l == x)))
}

MyVec <- c("a","b","a","a","c","c")

df %filter% MyVec

#    l n
# 1  a 1
# 2  a 2
# 3  a 3
# 4  b 1
# 5  b 2
# 6  b 3
# 7  a 1
# 8  a 2
# 9  a 3
# 10 a 1
# 11 a 2
# 12 a 3
# 13 c 1
# 14 c 2
# 15 c 3
# 16 c 1
# 17 c 2
# 18 c 3

再想一想,运营商非常愚蠢,因为它只适用于名为l的列。这个功能更通用了。

MyFilter <- function(df, search_list, column_name){
  do.call(rbind, lapply(search_list, function(x) df %>% filter(get(column_name) == x)))
}

MyFilter(df, MyVec, "l")

#    l n
# 1  a 1
# 2  a 2
# 3  a 3
# 4  b 1
# 5  b 2
# 6  b 3
# 7  a 1
# 8  a 2
# 9  a 3
# 10 a 1
# 11 a 2
# 12 a 3
# 13 c 1
# 14 c 2
# 15 c 3
# 16 c 1
# 17 c 2
# 18 c 3

答案 1 :(得分:0)

df <- data.frame(l = rep(letters[1:3], each=3), n = rep(1:3, 3))

do.call(rbind, lapply(c("a","b","a","a","c","c"), function(x) df[df$l %in% x, ]))

   l n
1  a 1
2  a 2
3  a 3
4  b 1
5  b 2
6  b 3
11 a 1
21 a 2
31 a 3
12 a 1
22 a 2
32 a 3
7  c 1
8  c 2
9  c 3
71 c 1
81 c 2
91 c 3

编辑: 如果有序的行数很重要:

rownames(df_new) <- NULL

,然后新保存的df的行名将从1:18开始。

答案 2 :(得分:0)

我猜data.frame(l = c("a","b","a","a","c","c")) %>% inner_join(df,by = 'l') 可以解决问题。

{{1}}