我有一个包含组和子组的大型数据集。我想根据"完整性"过滤数据。这些组中的所有级别(a和b)都应该出现
一个小例子
group <- rep(c("A", "B", "C"), each=5)
a <- c(1,1,2,2,3,1,1,1,3,3,1,2,2,3,3)
b <- c("a", "a", "a", "b", "c", "a", "a", "a", "b", "c", "a", "b", "b", "b", "b")
df <- data.frame(group, a, b)
group a b
1 A 1 a
2 A 1 a
3 A 2 a
4 A 2 b
5 A 3 c
6 B 1 a
7 B 1 a
8 B 1 a
9 B 3 b
10 B 3 c
11 C 1 a
12 C 2 b
13 C 2 b
14 C 3 b
15 C 3 b
所以这里只有A被认为是完整的,因为a和b的所有级别都会出现。是否有一种有效(灵活)的方法来过滤这些条件?
答案 0 :(得分:3)
我会做这样的事情:
sapply(split(df, df$group), function(x) all(a %in% x$a) & all(b %in% x$b))
## A B C
## TRUE FALSE FALSE
答案 1 :(得分:2)
我会尝试data.table
,例如
library(data.table)
setDT(df)[, indx := length(unique(a)) + length(unique(b))]
df[, indx2 := length(unique(a)) + length(unique(b)), by = group]
df[indx == indx2]
# group a b indx indx2
# 1: A 1 a 6 6
# 2: A 1 a 6 6
# 3: A 2 a 6 6
# 4: A 2 b 6 6
# 5: A 3 c 6 6
或者对于更通用的解决方案,您可以指定列名称,然后使用.SDcols
,例如
cols <- c("a", "b")
setDT(df)[, indx := Reduce(sum, lapply(.SD, function(x) length(unique(x)))), .SDcols = cols]
df[, indx2 := Reduce(sum, lapply(.SD, function(x) length(unique(x)))), .SDcols = cols, by = group]
df[indx == indx2]
答案 2 :(得分:2)
以下是dplyr
解决方案:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
a_complete = all(unique(df$a) %in% a),
b_complete = all(unique(df$b) %in% b)
) %>%
filter(a_complete, b_complete) %>%
select(- ends_with("complete"))