在dplyr中使用summarize_at进行额外统计

时间:2017-04-24 18:00:32

标签: r dplyr

有没有办法在summarize_at电话中添加额外的统计信息?例如

iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd))

将计算4列的平均值和标准偏差(总共8列)。假设我也想知道每组中有多少行。即,像

# Below is not valid syntax 
iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(), funs(mean, sd)) + summarise(n())

鉴于上述情况不起作用,kludge

iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd, length))

实际上产生了4个计数列的副本。

也许这超出了summarize_at和朋友可以方便地处理的内容?

3 个答案:

答案 0 :(得分:10)

这个怎么样:

iris %>% 
    group_by(Species) %>% 
    mutate(Count = n()) %>%
    group_by(Species, Count) %>%
    summarize_at(vars(), funs(mean, sd))

答案 1 :(得分:2)

我们可以使用data.table以更灵活的方式执行此操作

library(data.table)
as.data.table(iris)[, c(n = .N, unlist(lapply(.SD, function(x) 
    list(Mean=mean(x), SD=sd(x))), recursive = FALSE)), .(Species)]
# Species  n Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1:     setosa 50             5.006       0.3524897            3.428      0.3790644             1.462       0.1736640            0.246
#2: versicolor 50             5.936       0.5161711            2.770      0.3137983             4.260       0.4699110            1.326
#3:  virginica 50             6.588       0.6358796            2.974      0.3224966             5.552       0.5518947            2.026
#   Petal.Width.SD
#1:      0.1053856
#2:      0.1977527
#3:      0.2746501

或者使用dplyr,我们可能需要执行join

iris1 <- iris %>%
             group_by(Species) %>% 
             summarise_all(funs(mean, sd))

iris %>% 
     group_by(Species) %>% 
     summarise(n = n()) %>%
     full_join(iris1)

bind_cols

iris %>%
 group_by(Species) %>% 
 summarise_all(funs(mean, sd)) %>% bind_cols(., iris %>% count(Species) %>% select(-Species))
# A tibble: 3 × 10
#     Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd     n
#      <fctr>             <dbl>            <dbl>             <dbl>            <dbl>           <dbl>          <dbl>           <dbl>          <dbl> <int>
#1     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644       0.1736640      0.1053856    50
#2 versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983       0.4699110      0.1977527    50
#3  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966       0.5518947      0.2746501    50

答案 2 :(得分:1)

指定应用统计信息的列:

Chocolatey

或应用于以iris %>% group_by(Species) %>% mutate(Count = n()) %>% group_by(Species, Count) %>% summarize_at(vars(Sepal.Length)), funs(mean, sd)) -> dt_stat dt_stat 开头的所有列:

"Sepal"