使用dplyr窗口函数计算百分位数

时间:2015-05-27 16:38:38

标签: r dplyr tidyr

我有一个可行的解决方案,但我正在寻找一种更清晰,更易读的解决方案,可能会利用一些较新的dplyr窗口函数。

使用mtcars数据集,如果我想查看第25,第50,第75百分位数以及平均值和每加仑英里数(“mpg”)的圆柱数(“cyl”),我使用以下代码:

library(dplyr)
library(tidyr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

# old dplyr solution 
mtcars %>% group_by(cyl) %>% 
  do(data.frame(p=p, stats=quantile(.$mpg, probs=p), 
                n = length(.$mpg), avg = mean(.$mpg))) %>%
  spread(p, stats) %>%
  select(1, 4:6, 3, 2)

# note: the select and spread statements are just to get the data into
#       the format in which I'd like to see it, but are not critical

有没有办法可以使用dplyr使用一些汇总函数(n_tiles,percent_rank等)更干净地完成这项工作?干净利落地,我的意思是没有“做”声明。

谢谢

10 个答案:

答案 0 :(得分:55)

更新2 :使用summarise()将以前版本的enframe转换为单行的另一个更新:

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(mpg = list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75))))) %>% 
  unnest
    cyl quantiles   mpg
1     4       25% 22.80
2     4       50% 26.00
3     4       75% 30.40
4     6       25% 18.65
5     6       50% 19.70
6     6       75% 21.00
7     8       25% 14.40
8     8       50% 15.20
9     8       75% 16.25

使用tidyeval可以将其转换为更通用的功能:

q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) {

  value.col=enquo(value.col)
  groups=enquos(...)

  data %>% 
    group_by(!!!groups) %>% 
    summarise(mpg = list(enframe(quantile(!!value.col, probs=probs)))) %>% 
    unnest
}

q_by_group(mtcars, mpg)
q_by_group(mtcars, mpg, cyl)
q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75))
q_by_group(iris, Petal.Width, Species)

更新:以下是@ JuliaSilge的答案的变体,该答案使用嵌套来获取分位数,但不使用map。但是,它确实需要额外的代码行来添加列出分位数级别的列,因为我不确定如何(或者是否可能)将分位数的名称捕获到单独的列中直接来自quantile的调用。

p = c(0.25,0.5,0.75)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(quantiles = list(sprintf("%1.0f%%", p*100)),
            mpg = list(quantile(mpg, p))) %>% 
  unnest

原始回答

这是一种dplyr方法,可以避免do,但需要为每个分位数值单独调用quantile

mtcars %>% group_by(cyl) %>%
  summarise(`25%`=quantile(mpg, probs=0.25),
            `50%`=quantile(mpg, probs=0.5),
            `75%`=quantile(mpg, probs=0.75),
            avg=mean(mpg),
            n=n())

  cyl   25%  50%   75%      avg  n
1   4 22.80 26.0 30.40 26.66364 11
2   6 18.65 19.7 21.00 19.74286  7
3   8 14.40 15.2 16.25 15.10000 14

如果summarise只需调用一次quantile就可以返回多个值,那会更好,但dplyr开发时这似乎是an open issue

答案 1 :(得分:29)

如果您准备使用purrr::map,可以这样做!

library(tidyverse)

mtcars %>%
  tbl_df() %>%
  nest(-cyl) %>%
  mutate(Quantiles = map(data, ~ quantile(.$mpg)),
         Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>% 
  unnest(Quantiles)

#> # A tibble: 15 x 3
#>      cyl key   value
#>    <dbl> <chr> <dbl>
#>  1     6 0%     17.8
#>  2     6 25%    18.6
#>  3     6 50%    19.7
#>  4     6 75%    21  
#>  5     6 100%   21.4
#>  6     4 0%     21.4
#>  7     4 25%    22.8
#>  8     4 50%    26  
#>  9     4 75%    30.4
#> 10     4 100%   33.9
#> 11     8 0%     10.4
#> 12     8 25%    14.4
#> 13     8 50%    15.2
#> 14     8 75%    16.2
#> 15     8 100%   19.2

reprex package创建于2018-11-10(v0.2.1)

这种方法的一个好处是输出整齐,每行一次观察。

答案 2 :(得分:15)

这是dplyr方法,它使用tidy()包的broom函数,遗憾的是它仍然需要do(),但它更简单。

library(dplyr)
library(broom)

mtcars %>%
    group_by(cyl) %>%
    do( tidy(t(quantile(.$mpg))) )

给出:

    cyl   X0.  X25.  X50.  X75. X100.
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1     4  21.4 22.80  26.0 30.40  33.9
2     6  17.8 18.65  19.7 21.00  21.4
3     8  10.4 14.40  15.2 16.25  19.2

请注意使用t(),因为broom包没有指定数字的方法。

这是基于我的earlier answer for summary() here

答案 3 :(得分:10)

不确定do()中如何避免dplyr,但您可以使用c()as.list()以及data.table以非常简单的方式执行此操作:< / p>

require(data.table) 
as.data.table(mtcars)[, c(as.list(quantile(mpg, probs=p)), 
                        avg=mean(mpg), n=.N), by=cyl]
#    cyl   25%  50%   75%      avg  n
# 1:   6 18.65 19.7 21.00 19.74286  7
# 2:   4 22.80 26.0 30.40 26.66364 11
# 3:   8 14.40 15.2 16.25 15.10000 14

如果您希望bykeyby列排序,请将cyl替换为<text transform="matrix(1 0 0 1 73.9063 529.8633)" font-family="'FranklinGothic-BookCnd'" font-size="8.2637">2</text>

答案 4 :(得分:3)

此解决方案仅使用dplyrtidyr,允许您在dplyr链中指定分位数,并利用tidyr::crossing()到&#34;堆栈&#34 ;在分组和汇总之前,数据集的多个副本。

diamonds %>%  # Initial data
  tidyr::crossing(pctile = 0:4/4) %>%  # Specify quantiles; crossing() is like expand.grid()
  dplyr::group_by(cut, pctile) %>%  # Indicate your grouping var, plus your quantile var
  dplyr::summarise(quantile_value = quantile(price, unique(pctile))) %>%  # unique() is needed
  dplyr::mutate(pctile = sprintf("%1.0f%%", pctile*100))  # Optional prettification

结果:

# A tibble: 25 x 3
# Groups:   cut [5]
         cut pctile quantile_value
       <ord>  <chr>          <dbl>
 1      Fair     0%         337.00
 2      Fair    25%        2050.25
 3      Fair    50%        3282.00
 4      Fair    75%        5205.50
 5      Fair   100%       18574.00
 6      Good     0%         327.00
 7      Good    25%        1145.00
 8      Good    50%        3050.50
 9      Good    75%        5028.00
10      Good   100%       18788.00
11 Very Good     0%         336.00
12 Very Good    25%         912.00
13 Very Good    50%        2648.00
14 Very Good    75%        5372.75
15 Very Good   100%       18818.00
16   Premium     0%         326.00
17   Premium    25%        1046.00
18   Premium    50%        3185.00
19   Premium    75%        6296.00
20   Premium   100%       18823.00
21     Ideal     0%         326.00
22     Ideal    25%         878.00
23     Ideal    50%        1810.00
24     Ideal    75%        4678.50
25     Ideal   100%       18806.00

unique()是必要的,让dplyr::summarise()知道您每组只需要一个值。

答案 5 :(得分:1)

以下是结合使用dplyrpurrrrlang的解决方案:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

p_names <- paste0(p*100, "%")
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)

# dplyr/purrr/rlang solution 
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg), funs(!!!p_funs))
#> # A tibble: 3 x 4
#>     cyl `25%` `50%` `75%`
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8  26    30.4
#> 2     6  18.6  19.7  21  
#> 3     8  14.4  15.2  16.2


#Especially useful if you want to summarize more variables
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg, drat), funs(!!!p_funs))
#> # A tibble: 3 x 7
#>     cyl `mpg_25%` `drat_25%` `mpg_50%` `drat_50%` `mpg_75%` `drat_75%`
#>   <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
#> 1     4      22.8       3.81      26         4.08      30.4       4.16
#> 2     6      18.6       3.35      19.7       3.9       21         3.91
#> 3     8      14.4       3.07      15.2       3.12      16.2       3.22

reprex package(v0.2.0)于2018-10-01创建。

编辑(2019-04-17):

dplyr 0.8.0起,不推荐使用funs函数,而推荐使用list将所需的函数传递给作用域的dplyr函数。结果,上面的实现变得更加直接。我们不再需要担心用!!!取消引用这些函数。请参阅下面的reprex

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

p_names <- paste0(p*100, "%")
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)

# dplyr/purrr/rlang solution 
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg), p_funs)
#> # A tibble: 3 x 4
#>     cyl `25%` `50%` `75%`
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8  26    30.4
#> 2     6  18.6  19.7  21  
#> 3     8  14.4  15.2  16.2


#Especially useful if you want to summarize more variables
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg, drat), p_funs)
#> # A tibble: 3 x 7
#>     cyl `mpg_25%` `drat_25%` `mpg_50%` `drat_50%` `mpg_75%` `drat_75%`
#>   <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
#> 1     4      22.8       3.81      26         4.08      30.4       4.16
#> 2     6      18.6       3.35      19.7       3.9       21         3.91
#> 3     8      14.4       3.07      15.2       3.12      16.2       3.22

reprex package(v0.2.0)于2019-04-17创建。

答案 6 :(得分:0)

这是一个相当可读的解决方案,使用dplyrpurrr以整齐的格式返回分位数:

<强>代码

library(dplyr)
library(purrr)

mtcars %>% 
    group_by(cyl) %>% 
    do({x <- .$mpg
        map_dfr(.x = c(.25, .5, .75),
                .f = ~ data_frame(Quantile = .x,
                                  Value = quantile(x, probs = .x)))
       })

<强>结果

# A tibble: 9 x 3
# Groups:   cyl [3]
    cyl Quantile Value
  <dbl>    <dbl> <dbl>
1     4     0.25 22.80
2     4     0.50 26.00
3     4     0.75 30.40
4     6     0.25 18.65
5     6     0.50 19.70
6     6     0.75 21.00
7     8     0.25 14.40
8     8     0.50 15.20
9     8     0.75 16.25

答案 7 :(得分:0)

实际上,

do()是正确的习惯用法,因为它是为按组转换而设计的。可以将其视为映射数据帧组的lapply()。 (对于这样的专用功能,“ do”之类的通用名称并不理想。但是更改它可能为时已晚。)

通常,您希望在每个cyl组中将quantile()应用于mpg列:

library(dplyr)

p <- c(.2, .5, .75)

mtcars %>% 
  group_by(cyl) %>%
  do(quantile(.$mpg, p))

#> Error: Results 1, 2, 3 must be data frames, not numeric

由于quantile()不返回数据帧而导致的无效;您必须显式转换其输出。由于此更改相当于用数据帧包装quantile(),因此可以使用gestalt函数组合运算符%>>>%

library(gestalt)
library(tibble)

quantile_tbl <- quantile %>>>% enframe("quantile")

mtcars %>% 
  group_by(cyl) %>%
  do(quantile_tbl(.$mpg, p))

#> # A tibble: 9 x 3
#> # Groups:   cyl [3]
#>     cyl quantile value
#>   <dbl> <chr>    <dbl>
#> 1     4 20%       22.8
#> 2     4 50%       26  
#> 3     4 75%       30.4
#> 4     6 20%       18.3
#> 5     6 50%       19.7
#> 6     6 75%       21  
#> 7     8 20%       13.9
#> 8     8 50%       15.2
#> 9     8 75%       16.2

答案 8 :(得分:0)

回答了许多不同的方法。 dplyr与众不同,这让我想做的事情变得与众不同。

mtcars %>%
   select(cyl, mpg) %>%
   group_by(cyl) %>%
   mutate( qnt_0   = quantile(mpg, probs= 0),
           qnt_25  = quantile(mpg, probs= 0.25),
           qnt_50  = quantile(mpg, probs= 0.5),
           qnt_75  = quantile(mpg, probs= 0.75),
           qnt_100 = quantile(mpg, probs= 1),
              mean = mean(mpg),
                sd = sd(mpg)
          ) %>%
   distinct(qnt_0 ,qnt_25 ,qnt_50 ,qnt_75 ,qnt_100 ,mean ,sd)

渲染

# A tibble: 3 x 8
# Groups:   cyl [3]
  qnt_0 qnt_25 qnt_50 qnt_75 qnt_100  mean    sd   cyl
  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>
1  17.8   18.6   19.7   21      21.4  19.7  1.45     6
2  21.4   22.8   26     30.4    33.9  26.7  4.51     4
3  10.4   14.4   15.2   16.2    19.2  15.1  2.56     8

答案 9 :(得分:0)

另一种方法来实现此目标,使用unnest_wider / longer

    mtcars %>%
       group_by(cyl) %>%
       summarise(quants = list(quantile(mpg, probs = c(.01, .1, .25, .5, .75, .90,.99)))) %>%
       unnest_wider(quants)

如果要对多个变量执行此操作,则可以在分组之前进行收集:

mtcars %>%
   gather(key = 'metric', value = 'value', -cyl) %>%
   group_by(cyl, metric) %>%
   summarise(quants = list(quantile(value, probs = c(.01, .1, .25, .5, .75, .90,.99)))) %>%
  unnest_wider(quants)