分组后在dplyr中使用t.test进行汇总

时间:2018-10-01 09:59:58

标签: r dplyr

library(dplyr)
library(ggplot2)
library(magrittr)

diamonds %>% 
  group_by(cut) %>% 
  summarise(price_avg = t.test(
    . %>% filter(color == "E") %$% price,
    . %>% filter(color == "I") %$% price )$p.value)

我正在尝试按组进行t.test的结果。在此示例中,查找在同一切割时颜色的价格是否存在显着差异。我得到的结果是:

Error in summarise_impl(.data, dots) : 
Evaluation error: is.atomic(x) is not TRUE.

2 个答案:

答案 0 :(得分:2)

library(tidyverse)
library(magrittr)

diamonds %>% 
  group_by(cut) %>% 
  summarise(price_avg = t.test(price[color=="E"], price[color=="I"])$p.value)

# # A tibble: 5 x 2
#   cut       price_avg
#   <ord>         <dbl>
# 1 Fair       3.90e- 3
# 2 Good       1.46e-12
# 3 Very Good  2.44e-39
# 4 Premium    7.27e-52
# 5 Ideal      7.63e-62

解决方案的问题是,.不会获得数据集的子集(基于分组),而是整个数据集。这样做检查:

diamonds %>% 
  group_by(cut) %>% 
  summarise(d = list(.))

# # A tibble: 5 x 2
#     cut       d                     
#     <ord>     <list>                
#   1 Fair      <tibble [53,940 x 10]>
#   2 Good      <tibble [53,940 x 10]>
#   3 Very Good <tibble [53,940 x 10]>
#   4 Premium   <tibble [53,940 x 10]>
#   5 Ideal     <tibble [53,940 x 10]>

替代解决方案是这样:

diamonds %>% 
  nest(-cut) %>%
  mutate(price_avg = map_dbl(data, ~t.test(
                                      .x %>% filter(color == "E") %$% price,
                                      .x %>% filter(color == "I") %$% price )$p.value))

# # A tibble: 5 x 3
#   cut       data                  price_avg
#   <ord>     <list>                    <dbl>
# 1 Ideal     <tibble [21,551 x 9]>  7.63e-62
# 2 Premium   <tibble [13,791 x 9]>  7.27e-52
# 3 Good      <tibble [4,906 x 9]>   1.46e-12
# 4 Very Good <tibble [12,082 x 9]>  2.44e-39
# 5 Fair      <tibble [1,610 x 9]>   3.90e- 3

此方法可与filter一起使用,因为您每次都可以将适当的数据子集(即列filter)传递给data

答案 1 :(得分:2)

必须 是一种更好的方法。我可能会采用Antonios的方法,但很想不使用ce@ubuntu1804:/usr/local/bin# workon temp virtualenvwrapper.user_scripts creating /ce/.virtualenvs/premkproject virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postmkproject virtualenvwrapper.user_scripts creating /ce/.virtualenvs/initialize virtualenvwrapper.user_scripts creating /ce/.virtualenvs/premkvirtualenv virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postmkvirtualenv virtualenvwrapper.user_scripts creating /ce/.virtualenvs/prermvirtualenv virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postrmvirtualenv virtualenvwrapper.user_scripts creating /ce/.virtualenvs/predeactivate virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postdeactivate virtualenvwrapper.user_scripts creating /ce/.virtualenvs/preactivate virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postactivate virtualenvwrapper.user_scripts creating /ce/.virtualenvs/get_env_details (temp) ce@ubuntu1804:/usr/local/bin# ,而是将不同颜色的价格分散到列表列中。不幸的是,结果我能想到的最好的代码甚至更长:

filter

这里的想法是获得两个列表列,diamonds %>% group_by(cut, color) %>% summarize(price = list(price)) %>% spread(color, price) %>% nest() %>% mutate(price_avg = map_dbl(data, ~ t.test(.x$E[[1L]], .x$I[[1L]])$p.value)) I,用于表示相应颜色的钻石的价格。现在,我们可以在这两列上进行t检验(但不幸的是,我们需要将它们取消列出才能起作用)。

我主要是将其作为对话的开始。显然,这不是您永远不想编写的代码,但我相信应该有一种表达这种逻辑的简短逻辑方法(要么已经可以,我就忽略了它,要么需要整洁的数据API?增强)。

或者,我们可以将公式API用于E

t.test

为完整起见,这里使用diamonds %>% filter(color %in% c('E', 'I')) %>% nest(-cut) %>% mutate(price_avg = map_dbl(data, ~ t.test(price ~ color, .x)$p.value)) 相同(返回的列多于p值):

broom::tidy

结果是这样的表:

diamonds %>%
    filter(color %in% c('E', 'I')) %>%
    nest(-cut) %>%
    mutate(test = map(data, ~ tidy(t.test(price ~ color, .x)))) %>%
    unnest(test)