创建一个"其他"领域

时间:2014-05-19 05:09:15

标签: r dplyr

现在,我有以下由original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n))创建的data.frame。

DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A", 
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701, 
47387, 47096, 45601, 40056, 36882)), .Names = c("Category", 
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))

         Category      n
1               E 163051
2               K 127133
3               M 106680
4               L  64868
5               I  49701
6               A  47387
7               S  47096
8               G  45601
9               N  40056
10              Q  36882

我想从排名最低的类别中创建一个“其他”字段。即。

        Category      n
1              E 163051
2              K 127133
3              M 106680
4              L  64868
5              I  49701
6          Other 217022

现在,我正在做

rbind(filter(DF, rank(rev(n)) <= 5), 
  summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))

将不在前5名中的所有类别折叠为其他类别。

但我很好奇dplyr或其他现有的套餐是否有更好的方法。 “更好”我的意思是更简洁/可读。我也对使用更聪明或更灵活的方法选择Other的方法感兴趣。

3 个答案:

答案 0 :(得分:8)

这是另一种方法,假设每个类别(至少前5名)只出现一次:

df %.% 
  arrange(desc(n)) %.%       #you could skip this step since you arranged the input df already according to your question
  mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n))

#  Category      n
#1        E 163051
#2        I  49701
#3        K 127133
#4        L  64868
#5        M 106680
#6    Other 217022

编辑:

我刚刚注意到我的输出不再通过减少n来排序。在再次运行代码之后,我发现订单一直保留到group_by(Category)之后,但是当我之后运行summarize时,订单消失了(或者更确切地说,它似乎是由{{ 1}})。这应该是那样的吗?

以下是另外三种方式:

Category

答案 1 :(得分:5)

不同的包/不同的语法版本:

library(data.table)

dt = as.data.table(DF)

dt[order(-n), # your data is already sorted, so this does nothing for it
   if (.BY[[1]]) .SD else list("Other", sum(n)),
   by = 1:nrow(dt) <= 5][, !"nrow", with = F]
#   Category      n
#1:        E 163051
#2:        K 127133
#3:        M 106680
#4:        L  64868
#5:        I  49701
#6:    Other 217022

答案 2 :(得分:1)

此功能修改列,用Other替换不频繁的条目,方法是指定最小频率,或者指定所需的类别数。

#' @title Group infrequent entries into 'Other category'
#' @description Useful when you want to constrain the number of unique values in a column.
#' @param .data Data containing variable.
#' @param var Variable containing infrequent entries, to be collapsed into "Other". 
#' @param n Threshold for total number of categories above "Other".
#' @param count Threshold for total count of observations before "Other".
#' @param by Extra variables to group by when calculating \code{n} or \code{count}.
#' @param copy Should \code{.data} be copied? Currently only \code{TRUE} is supported.
#' @param other.category Value that infrequent entries are to be collapsed into. Defaults to \code{"Other"}.
#' @return \code{.data} but with \code{var} changed to be grouped into smaller categories.
#' @export 
mutate_other <- function(.data, var, n = 5, count, by = NULL, copy = TRUE, other.category = "Other"){
  stopifnot(is.data.table(.data), 
            is.character(other.category), 
            identical(length(other.category), 1L))

  had.key <- haskey(.data)

  if (!isTRUE(copy)){
    stop("copy must be TRUE")
  }

  out <- copy(.data)

  if (had.key){
    orig_key <- key(out)
  } else {
    orig_key <- "_order"
    out[, "_order" := 1:.N]
    setkeyv(out, "_order")
  }

  if (is.character(.data[[var]])){
    stopifnot(!("nvar" %in% names(.data)),
              var %in% names(.data))

    N <- .rank <- NULL
    n_by_var <-
      out %>%
      .[, .N, keyby = c(var, by)] %>%
      .[, .rank := rank(-N)]

    out <- merge(out, n_by_var, by = c(var, by))

    if (missing(count)){
      out[, (var) := dplyr::if_else(.rank <= n, out[[var]], other.category)]
    } else {
      out[, (var) := dplyr::if_else(N >= count, out[[var]], other.category)]
    }
    out <- 
      out %>%
      .[, N := NULL] %>%
      .[, .rank := NULL] 

    setkeyv(out, orig_key)

    if (!had.key){
      out[, (orig_key) := NULL]
      setkey(out, NULL)
    }
    out

  } else {
    warning("Attempted to use by = on a non-character vector. Aborting.")
    return(.data)
  }
}

https://github.com/HughParsonage/hutils/blob/master/R/mutate_other.R