优雅的方式从数据框中删除稀有因子水平

时间:2014-06-17 08:34:37

标签: r subset

我想按因子对数据帧进行子集化。我只想保留高于特定频率的因子水平。

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

此代码创建数据框:

   factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

我想要降低重复次数少于5次的因子水平。我开发了一个for循环,它正在工作:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

但是存在更快更漂亮的解决方案吗?

6 个答案:

答案 0 :(得分:11)

require(dplyr)

df %>% group_by(factor) %>% filter(n() >= 5)
#factor   variable
#1       a  2.0769363
#2       a  0.6187513
#3       a  0.2426108
#4       a -0.4279296
#5       a  0.2270024
#6       b -0.6839748
#7       b -0.3285610
#8       b  0.2625743
#9       b -0.9532957
#10      b  1.4526317

答案 1 :(得分:6)

library(data.table)
setDT(df)[, variable[.N >= 5], by = factor]

##    factor         V1
## 1:      a -0.8204684
## 2:      a  0.4874291
## 3:      a  0.7383247
## 4:      a  0.5757814
## 5:      a -0.3053884
## 6:      b  1.5117812
## 7:      b  0.3898432
## 8:      b -0.6212406
## 9:      b -2.2146999
## 10:     b  1.1249309

答案 2 :(得分:6)

怎么样?
df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]

答案 3 :(得分:3)

也许加入过滤因子的计数:

library(dplyr)
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) 
df.1 <- semi_join(df, common.factors)

答案 4 :(得分:0)

尝试使用基本功能......

lvl = as.data.frame(table(df$factor))
colnames(lvl) = c('factor','count')
lvl
  factor count
1      a     5
2      b     5
3      c     2

df[df$factor %in% lvl[lvl$count>=5,]$factor,]
   factor    variable
1       a -0.01619026
2       a  0.94383621
3       a  0.82122120
4       a  0.59390132
5       a  0.91897737
6       b  0.78213630
7       b  0.07456498
8       b -1.98935170
9       b  0.61982575
10      b -0.05612874

答案 5 :(得分:0)

这对我有用:

df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]