我想知道是否有办法使用summarise
(dplyr 0.1.2
)的函数返回多个值(例如来自describe
包的psych
函数)。
如果没有,是因为它尚未实施,还是有理由不是一个好主意?
示例:
require(psych)
require(ggplot2)
require(dplyr)
dgrp <- group_by(diamonds, cut)
describe(dgrp$price)
summarise(dgrp, describe(price))
生成:Error: expecting a single value
答案 0 :(得分:39)
dplyr
&gt; = 0.2,我们可以使用do
函数:
library(ggplot2)
library(psych)
library(dplyr)
diamonds %>%
group_by(cut) %>%
do(describe(.$price)) %>%
select(-vars)
#> Source: local data frame [5 x 13]
#> Groups: cut [5]
#>
#> cut n mean sd median trimmed mad min max range skew kurtosis se
#> (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#> 1 Fair 1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281
#> 2 Good 4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721
#> 4 Premium 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497
#> 5 Ideal 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
基于purrr
包的解决方案:
library(ggplot2)
library(psych)
library(purrr)
diamonds %>%
slice_rows("cut") %>%
by_slice(~ describe(.x$price), .collate = "rows")
#> Source: local data frame [5 x 14]
#>
#> cut vars n mean sd median trimmed mad min max range skew kurtosis se
#> (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#> 1 Fair 1 1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281
#> 2 Good 1 4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good 1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721
#> 4 Premium 1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497
#> 5 Ideal 1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
但是只有data.table
:
as.data.table(diamonds)[, describe(price), by = cut]
#> cut vars n mean sd median trimmed mad min max range skew kurtosis se
#> 1: Ideal 1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
#> 2: Premium 1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497
#> 3: Good 1 4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197
#> 4: Very Good 1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721
#> 5: Fair 1 1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281
我们可以编写自己的汇总函数,它返回一个列表:
fun <- function(x) {
list(n = length(x),
min = min(x),
median = as.numeric(median(x)),
mean = mean(x),
sd = sd(x),
max = max(x))
}
as.data.table(diamonds)[, fun(price), by = cut]
#> cut n min median mean sd max
#> 1: Ideal 21551 326 1810.0 3457.542 3808.401 18806
#> 2: Premium 13791 326 3185.0 4584.258 4349.205 18823
#> 3: Good 4906 327 3050.5 3928.864 3681.590 18788
#> 4: Very Good 12082 336 2648.0 3981.760 3935.862 18818
#> 5: Fair 1610 337 3282.0 4358.758 3560.387 18574
答案 1 :(得分:1)
在最近的tidyverse版本中,这是可能的。
首先,在您提供的示例中,该函数返回一个单行数据帧。如果我们在summarize()
中使用这样的函数,它将生成一个数据框列,我们可以通过unpack()
将其转换为单独的列。
library(tidyverse)
library(psych)
describe(diamonds$price)
#> vars n mean sd median trimmed mad min max range skew
#> X1 1 53940 3932.8 3989.44 2401 3158.99 2475.94 326 18823 18497 1.62
#> kurtosis se
#> X1 2.18 17.18
diamonds %>%
group_by(cut) %>%
summarize(descr = describe(price)) %>%
unpack(cols = descr)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 14
#> cut vars n mean sd median trimmed mad min max range skew
#> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair 1 1610 4359. 3560. 3282 3696. 2183. 337 18574 18237 1.78
#> 2 Good 1 4906 3929. 3682. 3050. 3252. 2853. 327 18788 18461 1.72
#> 3 Very… 1 12082 3982. 3936. 2648 3243. 2855. 336 18818 18482 1.60
#> 4 Prem… 1 13791 4584. 4349. 3185 3822. 3371. 326 18823 18497 1.33
#> 5 Ideal 1 21551 3458. 3808. 1810 2656. 1631. 326 18806 18480 1.84
#> # … with 2 more variables: kurtosis <dbl>, se <dbl>
第二,在某些情况下,函数只是返回一个向量作为输出。在这种情况下,summarize()
会为每个生成的值生成一个新行。
set.seed(1234)
dsmall <- diamonds[sample(nrow(diamonds), 25), ]
unique(dsmall$clarity)
#> [1] I1 SI2 VVS2 VS1 VVS1 VS2 SI1 IF
#> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
dsmall %>%
group_by(cut) %>%
summarize(clarity = unique(clarity))
#> `summarise()` regrouping output by 'cut' (override with `.groups` argument)
#> # A tibble: 17 x 2
#> # Groups: cut [4]
#> cut clarity
#> <ord> <ord>
#> 1 Good I1
#> 2 Good SI2
#> 3 Good VS1
#> 4 Good SI1
#> 5 Very Good VVS2
#> 6 Very Good SI2
#> 7 Very Good VS1
#> 8 Very Good IF
#> 9 Premium SI2
#> 10 Premium SI1
#> 11 Ideal VS1
#> 12 Ideal VVS1
#> 13 Ideal VS2
#> 14 Ideal VVS2
#> 15 Ideal SI1
#> 16 Ideal SI2
#> 17 Ideal IF
由reprex package(v0.3.0)于2020-07-14创建