我有一个包含多个列的data.frame,并希望根据变量的组合过滤低频数据。这个例子就像男性/女性的性别变量和胆固醇变量的高/低。然后我的数据框就像:
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
index Sex Age
1 1 Male High
2 2 Female High
3 3 Male High
4 4 Female High
5 5 Female High
6 6 Male High
7 7 Female High
8 8 Female High
9 9 Female Low
10 10 Male Low
11 11 Female High
12 12 Male High
13 13 Female High
14 14 Female High
15 15 Male Low
16 16 Female Low
17 17 Male High
18 18 Male Low
19 19 Male Low
20 20 Female Low
现在我想过滤频率高于3的性别/年龄组合
table(df[,2:3])
Age
Sex High Low
Female 8 3
Male 5 4
换句话说,我希望保持女性高,男性低和男性高的指数。
注意 1)我的数据框有几个变量(不像上面的例子)和2)我不希望使用任何第三个R包和3)我希望它快。
答案 0 :(得分:7)
这是基础R中的一个简单方法:
lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
# index Sex Age
#1 1 Male High
#2 2 Female High
#3 3 Male High
#4 4 Female High
#5 5 Female High
#6 6 Male High
#7 7 Female High
#8 8 Female High
#10 10 Male Low
#11 11 Female High
#12 12 Male High
#13 13 Female High
#14 14 Female High
#15 15 Male Low
#17 17 Male High
#18 18 Male Low
#19 19 Male Low
如果您有更多变量,可以将它们存储在矢量中:
vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
这是使用ave
的第二个基础R方法:
subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
答案 1 :(得分:4)
好的,这是一个Base-R选项
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
merge(
df
, aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
, by = c("Sex", "Age")
)
汇总函数sum
为所有组合的所有1
s。
答案 2 :(得分:4)
我们可以使用StringSplitOptions.RemoveEmptyEntries
执行此操作,它也应该有效
data.table
或library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]
.I
答案 3 :(得分:1)
:
答案是
dplyr
即使在OP 中声明,这也不是基本的R解决方案。认为它可能对没有此类限制的未来用户有用。
答案 4 :(得分:1)
vars <- c("Sex","Age")
max_freq <- 3
new_df <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])
new_df
# Sex Age index
# 1 Female High 2
# 2 Female High 7
# 3 Female High 14
# 4 Female High 11
# 5 Female High 5
# 6 Female High 4
# 7 Female High 13
# 8 Female High 8
# 9 Male High 6
# 10 Male High 3
# 11 Male High 1
# 12 Male High 17
# 13 Male High 12
# 14 Male Low 10
# 15 Male Low 15
# 16 Male Low 18
# 17 Male Low 19