有没有办法在summarize_at
电话中添加额外的统计信息?例如
iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd))
将计算4列的平均值和标准偏差(总共8列)。假设我也想知道每组中有多少行。即,像
# Below is not valid syntax
iris %>%
group_by(Species) %>%
summarise_at(vars(), funs(mean, sd)) + summarise(n())
鉴于上述情况不起作用,kludge
iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd, length))
实际上产生了4个计数列的副本。
也许这超出了summarize_at
和朋友可以方便地处理的内容?
答案 0 :(得分:10)
这个怎么样:
iris %>%
group_by(Species) %>%
mutate(Count = n()) %>%
group_by(Species, Count) %>%
summarize_at(vars(), funs(mean, sd))
答案 1 :(得分:2)
我们可以使用data.table
以更灵活的方式执行此操作
library(data.table)
as.data.table(iris)[, c(n = .N, unlist(lapply(.SD, function(x)
list(Mean=mean(x), SD=sd(x))), recursive = FALSE)), .(Species)]
# Species n Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1: setosa 50 5.006 0.3524897 3.428 0.3790644 1.462 0.1736640 0.246
#2: versicolor 50 5.936 0.5161711 2.770 0.3137983 4.260 0.4699110 1.326
#3: virginica 50 6.588 0.6358796 2.974 0.3224966 5.552 0.5518947 2.026
# Petal.Width.SD
#1: 0.1053856
#2: 0.1977527
#3: 0.2746501
或者使用dplyr
,我们可能需要执行join
iris1 <- iris %>%
group_by(Species) %>%
summarise_all(funs(mean, sd))
iris %>%
group_by(Species) %>%
summarise(n = n()) %>%
full_join(iris1)
或bind_cols
iris %>%
group_by(Species) %>%
summarise_all(funs(mean, sd)) %>% bind_cols(., iris %>% count(Species) %>% select(-Species))
# A tibble: 3 × 10
# Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd n
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856 50
#2 versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983 0.4699110 0.1977527 50
#3 virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501 50
答案 2 :(得分:1)
指定应用统计信息的列:
Chocolatey
或应用于以iris %>% group_by(Species) %>%
mutate(Count = n()) %>%
group_by(Species, Count) %>%
summarize_at(vars(Sepal.Length)), funs(mean, sd)) -> dt_stat
dt_stat
开头的所有列:
"Sepal"