Question

我想聚合一个数据帧，同时在基数R中添加一个新列（N），该列对分组变量每个值的行数进行计数。

在dplyr中这很简单：

library(dplyr)
data(iris)

combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)

> combined_summary
# A tibble: 3 x 6
# Groups:   Species [3]
  Species        N Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <int>        <dbl>       <dbl>        <dbl>       <dbl>
1 setosa        50         5.01        3.43         1.46       0.246
2 versicolor    50         5.94        2.77         4.26       1.33 
3 virginica     50         6.59        2.97         5.55       2.03

但是我很不幸，不得不在不允许使用软件包的环境中编写此代码（不要问；这不是我的决定）。因此，我需要一种在基数R中执行此操作的方法。

我可以在R的基础上进行如下操作：

# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)

> summary_means
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

> summary_count
     Species Sepal.Length
1     setosa           50
2 versicolor           50
3  virginica           50

# Then rename the count column
colnames(summary_count)[2] <- "N"

> summary_count
     Species  N
1     setosa 50
2 versicolor 50
3  virginica 50

# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)

> combined_summary_baseR
     Species  N Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa 50        5.006       3.428        1.462       0.246
2 versicolor 50        5.936       2.770        4.260       1.326
3  virginica 50        6.588       2.974        5.552       2.026

在基R中，有什么方法可以更有效地做到这一点？

Answer 1

这是使用单个by调用（进行汇总）的基本R选项

do.call(rbind, by(
    iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
#            N Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa     50        5.006       3.428        1.462       0.246
#versicolor 50        5.936       2.770        4.260       1.326
#virginica  50        6.588       2.974        5.552       2.026

使用colMeans可以确保使用列名，从而避免了额外的setNames调用。

更新

要回应您的评论，将行名作为单独的列需要额外的步骤。

d <- do.call(rbind, by(
    iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
cbind(Species = rownames(d), as.data.frame(d))

不如最初的by调用那么简洁。我认为我们在这里存在哲学冲突。在dplyr（和tidyverse）中，通常应避免使用行名，以符合“整洁数据”的原则。在基本R中，行名称是通用的，并且（或多或少）行名称始终通过数据操作来携带。因此，您在某种程度上要求dplyr（整洁）和基本R数据结构概念的混合，这可能不是最佳/稳健的方法。

R：在使用基本R添加新计数列的同时聚合数据

1 个答案:

更新