我想按因子对数据帧进行子集化。我只想保留高于特定频率的因子水平。
df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))
此代码创建数据框:
factor variable
1 a -1.55902013
2 a 0.22355431
3 a -1.52195456
4 a -0.32842689
5 a 0.85650212
6 b 0.00962240
7 b -0.06621508
8 b -1.41347823
9 b 0.08969098
10 b 1.31565582
11 c -1.26141417
12 c -0.33364069
我想要降低重复次数少于5次的因子水平。我开发了一个for循环,它正在工作:
for (i in 1:length(levels(df$factor))){
if(table(df$factor)[i] < 5){
df.new <- df[df$factor != names(table(df$factor))[i],]
}
}
但是存在更快更漂亮的解决方案吗?
答案 0 :(得分:11)
require(dplyr)
df %>% group_by(factor) %>% filter(n() >= 5)
#factor variable
#1 a 2.0769363
#2 a 0.6187513
#3 a 0.2426108
#4 a -0.4279296
#5 a 0.2270024
#6 b -0.6839748
#7 b -0.3285610
#8 b 0.2625743
#9 b -0.9532957
#10 b 1.4526317
答案 1 :(得分:6)
library(data.table)
setDT(df)[, variable[.N >= 5], by = factor]
## factor V1
## 1: a -0.8204684
## 2: a 0.4874291
## 3: a 0.7383247
## 4: a 0.5757814
## 5: a -0.3053884
## 6: b 1.5117812
## 7: b 0.3898432
## 8: b -0.6212406
## 9: b -2.2146999
## 10: b 1.1249309
答案 2 :(得分:6)
怎么样?
df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]
答案 3 :(得分:3)
也许加入过滤因子的计数:
library(dplyr)
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5)
df.1 <- semi_join(df, common.factors)
答案 4 :(得分:0)
尝试使用基本功能......
lvl = as.data.frame(table(df$factor))
colnames(lvl) = c('factor','count')
lvl
factor count
1 a 5
2 b 5
3 c 2
df[df$factor %in% lvl[lvl$count>=5,]$factor,]
factor variable
1 a -0.01619026
2 a 0.94383621
3 a 0.82122120
4 a 0.59390132
5 a 0.91897737
6 b 0.78213630
7 b 0.07456498
8 b -1.98935170
9 b 0.61982575
10 b -0.05612874
答案 5 :(得分:0)
这对我有用:
df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]