library(dplyr)
library(ggplot2)
library(magrittr)
diamonds %>%
group_by(cut) %>%
summarise(price_avg = t.test(
. %>% filter(color == "E") %$% price,
. %>% filter(color == "I") %$% price )$p.value)
我正在尝试按组进行t.test的结果。在此示例中,查找在同一切割时颜色的价格是否存在显着差异。我得到的结果是:
Error in summarise_impl(.data, dots) :
Evaluation error: is.atomic(x) is not TRUE.
答案 0 :(得分:2)
library(tidyverse)
library(magrittr)
diamonds %>%
group_by(cut) %>%
summarise(price_avg = t.test(price[color=="E"], price[color=="I"])$p.value)
# # A tibble: 5 x 2
# cut price_avg
# <ord> <dbl>
# 1 Fair 3.90e- 3
# 2 Good 1.46e-12
# 3 Very Good 2.44e-39
# 4 Premium 7.27e-52
# 5 Ideal 7.63e-62
解决方案的问题是,.
不会获得数据集的子集(基于分组),而是整个数据集。这样做检查:
diamonds %>%
group_by(cut) %>%
summarise(d = list(.))
# # A tibble: 5 x 2
# cut d
# <ord> <list>
# 1 Fair <tibble [53,940 x 10]>
# 2 Good <tibble [53,940 x 10]>
# 3 Very Good <tibble [53,940 x 10]>
# 4 Premium <tibble [53,940 x 10]>
# 5 Ideal <tibble [53,940 x 10]>
替代解决方案是这样:
diamonds %>%
nest(-cut) %>%
mutate(price_avg = map_dbl(data, ~t.test(
.x %>% filter(color == "E") %$% price,
.x %>% filter(color == "I") %$% price )$p.value))
# # A tibble: 5 x 3
# cut data price_avg
# <ord> <list> <dbl>
# 1 Ideal <tibble [21,551 x 9]> 7.63e-62
# 2 Premium <tibble [13,791 x 9]> 7.27e-52
# 3 Good <tibble [4,906 x 9]> 1.46e-12
# 4 Very Good <tibble [12,082 x 9]> 2.44e-39
# 5 Fair <tibble [1,610 x 9]> 3.90e- 3
此方法可与filter
一起使用,因为您每次都可以将适当的数据子集(即列filter
)传递给data
。
答案 1 :(得分:2)
必须 是一种更好的方法。我可能会采用Antonios的方法,但很想不使用ce@ubuntu1804:/usr/local/bin# workon temp
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/premkproject
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postmkproject
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/initialize
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/premkvirtualenv
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postmkvirtualenv
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/prermvirtualenv
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postrmvirtualenv
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/predeactivate
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postdeactivate
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/preactivate
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/postactivate
virtualenvwrapper.user_scripts creating /ce/.virtualenvs/get_env_details
(temp) ce@ubuntu1804:/usr/local/bin#
,而是将不同颜色的价格分散到列表列中。不幸的是,结果我能想到的最好的代码甚至更长:
filter
这里的想法是获得两个列表列,diamonds %>%
group_by(cut, color) %>%
summarize(price = list(price)) %>%
spread(color, price) %>%
nest() %>%
mutate(price_avg = map_dbl(data, ~ t.test(.x$E[[1L]], .x$I[[1L]])$p.value))
和I
,用于表示相应颜色的钻石的价格。现在,我们可以在这两列上进行t检验(但不幸的是,我们需要将它们取消列出才能起作用)。
我主要是将其作为对话的开始。显然,这不是您永远不想编写的代码,但我相信应该有一种表达这种逻辑的简短逻辑方法(要么已经可以,我就忽略了它,要么需要整洁的数据API?增强)。
或者,我们可以将公式API用于E
:
t.test
为完整起见,这里使用diamonds %>%
filter(color %in% c('E', 'I')) %>%
nest(-cut) %>%
mutate(price_avg = map_dbl(data, ~ t.test(price ~ color, .x)$p.value))
相同(返回的列多于p值):
broom::tidy
结果是这样的表:
diamonds %>%
filter(color %in% c('E', 'I')) %>%
nest(-cut) %>%
mutate(test = map(data, ~ tidy(t.test(price ~ color, .x)))) %>%
unnest(test)