现在,我有以下由original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n))
创建的data.frame。
DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A",
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701,
47387, 47096, 45601, 40056, 36882)), .Names = c("Category",
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))
Category n
1 E 163051
2 K 127133
3 M 106680
4 L 64868
5 I 49701
6 A 47387
7 S 47096
8 G 45601
9 N 40056
10 Q 36882
我想从排名最低的类别中创建一个“其他”字段。即。
Category n
1 E 163051
2 K 127133
3 M 106680
4 L 64868
5 I 49701
6 Other 217022
现在,我正在做
rbind(filter(DF, rank(rev(n)) <= 5),
summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))
将不在前5名中的所有类别折叠为其他类别。
但我很好奇dplyr
或其他现有的套餐是否有更好的方法。 “更好”我的意思是更简洁/可读。我也对使用更聪明或更灵活的方法选择Other
的方法感兴趣。
答案 0 :(得分:8)
这是另一种方法,假设每个类别(至少前5名)只出现一次:
df %.%
arrange(desc(n)) %.% #you could skip this step since you arranged the input df already according to your question
mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
group_by(Category) %.%
summarize(n = sum(n))
# Category n
#1 E 163051
#2 I 49701
#3 K 127133
#4 L 64868
#5 M 106680
#6 Other 217022
编辑:
我刚刚注意到我的输出不再通过减少n
来排序。在再次运行代码之后,我发现订单一直保留到group_by(Category)
之后,但是当我之后运行summarize
时,订单消失了(或者更确切地说,它似乎是由{{ 1}})。这应该是那样的吗?
以下是另外三种方式:
Category
答案 1 :(得分:5)
不同的包/不同的语法版本:
library(data.table)
dt = as.data.table(DF)
dt[order(-n), # your data is already sorted, so this does nothing for it
if (.BY[[1]]) .SD else list("Other", sum(n)),
by = 1:nrow(dt) <= 5][, !"nrow", with = F]
# Category n
#1: E 163051
#2: K 127133
#3: M 106680
#4: L 64868
#5: I 49701
#6: Other 217022
答案 2 :(得分:1)
此功能修改列,用Other
替换不频繁的条目,方法是指定最小频率,或者指定所需的类别数。
#' @title Group infrequent entries into 'Other category'
#' @description Useful when you want to constrain the number of unique values in a column.
#' @param .data Data containing variable.
#' @param var Variable containing infrequent entries, to be collapsed into "Other".
#' @param n Threshold for total number of categories above "Other".
#' @param count Threshold for total count of observations before "Other".
#' @param by Extra variables to group by when calculating \code{n} or \code{count}.
#' @param copy Should \code{.data} be copied? Currently only \code{TRUE} is supported.
#' @param other.category Value that infrequent entries are to be collapsed into. Defaults to \code{"Other"}.
#' @return \code{.data} but with \code{var} changed to be grouped into smaller categories.
#' @export
mutate_other <- function(.data, var, n = 5, count, by = NULL, copy = TRUE, other.category = "Other"){
stopifnot(is.data.table(.data),
is.character(other.category),
identical(length(other.category), 1L))
had.key <- haskey(.data)
if (!isTRUE(copy)){
stop("copy must be TRUE")
}
out <- copy(.data)
if (had.key){
orig_key <- key(out)
} else {
orig_key <- "_order"
out[, "_order" := 1:.N]
setkeyv(out, "_order")
}
if (is.character(.data[[var]])){
stopifnot(!("nvar" %in% names(.data)),
var %in% names(.data))
N <- .rank <- NULL
n_by_var <-
out %>%
.[, .N, keyby = c(var, by)] %>%
.[, .rank := rank(-N)]
out <- merge(out, n_by_var, by = c(var, by))
if (missing(count)){
out[, (var) := dplyr::if_else(.rank <= n, out[[var]], other.category)]
} else {
out[, (var) := dplyr::if_else(N >= count, out[[var]], other.category)]
}
out <-
out %>%
.[, N := NULL] %>%
.[, .rank := NULL]
setkeyv(out, orig_key)
if (!had.key){
out[, (orig_key) := NULL]
setkey(out, NULL)
}
out
} else {
warning("Attempted to use by = on a non-character vector. Aborting.")
return(.data)
}
}
https://github.com/HughParsonage/hutils/blob/master/R/mutate_other.R