如何将低频率的因子分组为R中的“其他”因子

时间:2016-04-06 12:46:17

标签: r ggplot2 dataframe dplyr summary

# Generate counts table
library(plyr)
example <- data.frame(count(diamonds,c('color', 'cut')))
example[1:3,]

# Excerpt of table
       color  cut   freq
1      D      Fair  163
2      D      Good  662
3      D Very Good 1513

您可以轻松过滤freq&gt;表格。 1000:example[example$freq > 1000,]。我想生成一个类似于此的表,除非所有的值小于一个值,例如1000个行包含在(Other)行中,类似于当您有太多因素并致电summary(example, maxsum=3)时所发生的情况。

     color         cut          freq     
 D      : 5   Fair   : 7   Min.   : 119  
 E      : 5   Good   : 7   1st Qu.: 592  
 (Other):25   (Other):21   Median :1204  
                           Mean   :1541  
                           3rd Qu.:2334  
                           Max.   :4884 

理想输出示例:

理想情况下,我想转换此example[example$color=='J',]

 color   cut freq
 J      Fair  119
 J      Good  307
 J Very Good  678
 J   Premium  808
 J     Ideal  896

并产生这个:

 color       cut freq
     J Very Good  678
     J   Premium  808
     J     Ideal  896
     J   (Other)  426 

加成: 如果使用ggplot进行这种过滤可以创建如下图,但通过这种过滤,这也很棒。

ggplot(example, aes(x=color, y=freq)) + geom_bar(aes(fill=cut), stat = "identity")

enter image description here

2 个答案:

答案 0 :(得分:3)

以下是使用dplyr将正确数据直接传输到ggplot调用的替代方法。

library(dplyr)
example %>% mutate(cut = ifelse(freq < 500, "Other", levels(cut))) %>%
  group_by(color, cut) %>%
  summarise(freq = sum(freq)) %>%
  ggplot(aes(color, freq, fill = cut)) +
  geom_bar(stat = "identity")

enter image description here

请务必分离plyr,否则dplyr来电的输出将不正确。

答案 1 :(得分:1)

试试这个:

library(plyr)
library(ggplot2)
example <- data.frame(count(diamonds,c('color', 'cut')))


# Compute the row id where frequency is lower than some threshold
idx <- example$freq < 1000

# Create a helper function that adds the level "Other" to a vector
add_other_level <- function(x){
  levels(x) <- c(levels(x), "Other")
  x
}

# Change the factor leves for the threshold id rows
example <- within(example, 
       {
         color <- add_other_level(color)
         color[idx] <- "Other"
         cut <- add_other_level(cut)
         cut[idx]    <- "Other"
       }
)

# Create a plot
ggplot(example, aes(x = color, y = freq, fill = cut)) + 
  geom_bar(stat = "identity")

enter image description here