我试图找到一种比较t检验的优雅方法,比较6组数据的均值,最好使用dplyr / tidyverse。我的数据看起来类似于:
分组变量数字变量
A 5.6
A 2.3
A 4.8
B 7.3
B 6.9
B 5.8
C 1.4
C 6.4
我知道我可以做类似的事情:
df_a <- df %>% filter(grouping_variable == 'A')
df_b <- df %>% filter(grouping_variable == 'B')
a_b <- t.test(df_a,df_b)$p.value
,然后对每个变量组合重复该操作。分组变量只有6个,因此上面的内容不是不可能的,但是必须有一种更简单的方法:
df %>% group_by(grouping_variable)%>%
t.test(of each on each)
也许有些整洁?
我的最终结果是沿着
A B C D E F
.34 .4 .235 ...
B .03 .34 .454 ...
答案 0 :(得分:1)
可以使用purrr中的cross
和map
函数来干净地完成此操作。
样本数据:
df <- tibble(group_var = rep(c("A", "B", "C"), times = 5),
num_var = rnorm(15))
df
# A tibble: 15 x 2
group_var num_var
<chr> <dbl>
1 A 1.66
2 B -0.694
3 C -0.680
4 A 1.96
5 B -0.380
6 C -0.941
7 A 1.02
8 B 0.0476
9 C 0.770
10 A 1.41
11 B 0.137
12 C -0.816
13 A -0.478
14 B 0.374
15 C -0.619
使用cross
创建具有所有变量组合的数据框:
test_results <- cross_df(list(var1 = c("A", "B", "C"), var2 = c("A", "B", "C")))
添加带有ttest结果的列:
test_results <- test_results %>%
mutate(ttest = map2_dbl(var1, var2,
~ t.test(df %>% filter(group_var == .x) %>% .$num_var,
df %>% filter(group_var == .y) %>% .$num_var)$p.value))
test_results %>%
spread(var2, ttest)
var1 A B C
<chr> <dbl> <dbl> <dbl>
1 A 1 0.0436 0.0197
2 B 0.0436 1 0.367
3 C 0.0197 0.367 1
如果将t.test包裹在函数中,这会更容易阅读:
ttester <- function(v1, v2) {
t <- t.test(df %>% filter(group_var == v1) %>% .$num_var,
df %>% filter(group_var == v2) %>% .$num_var)
t$p.value
}
cross_df(list(var1 = c("A", "B", "C"), var2 = c("A", "B", "C"))) %>%
mutate(ttest = map2_dbl(var1, var2, ~ttester(.x, .y))) %>%
spread(var2, ttest)
var1 A B C
<chr> <dbl> <dbl> <dbl>
1 A 1 0.0436 0.0197
2 B 0.0436 1 0.367
3 C 0.0197 0.367 1
答案 1 :(得分:0)
检查此解决方案:
library(tidyverse)
library(magrittr)
df %$%
crossing(
gr1 = grouping_variable %>% unique(),
gr2 = grouping_variable %>% unique()
) %>%
filter(gr1 != gr2) %>%
left_join(
df %>%
group_by(grouping_variable) %>%
nest() %>%
rename_all(~c('gr1', 'data1'))
) %>%
left_join(
df %>%
group_by(grouping_variable) %>%
nest() %>%
rename_all(~c('gr2', 'data2'))
) %>%
mutate(p_val = map2_dbl(
data1, data2,
~t.test(
.x$numerical_variable,
.y$numerical_variable
)$p.value
)
)
答案 2 :(得分:0)
首先,一些数据:
df <-
data_frame(
Group = rep(LETTERS[1:8], each = 10)
, Value = rnorm(80)
)
由此,我将获得唯一的组级别:
my_groups <-
sort(unique(df$Group))
然后,我喜欢使用lapply
遍历感兴趣的指标。基本上,对于每对组,我都运行t检验并将感兴趣的指标(组均值,差异,p值)记录为data_frame
,然后将行绑定在一起。请注意,我使用%$%
中的magrittr
运算符作为从t.test
结果中获取指标的捷径。
t_tests_out <-
lapply(my_groups, function(group_a){
lapply(my_groups, function(group_b){
# Skip case where a and b are the same
if(group_a == group_b){
return(NULL)
}
df %>%
filter(Group %in% c(group_a, group_b)) %>%
mutate(temp_group = ifelse(Group == group_a, "A", "B")) %>%
t.test(Value ~ temp_group, data = .) %$%
data_frame(
group_a = group_a
, group_b = group_b
, mean_a = estimate[1]
, mean_b = estimate[2]
, diff = mean_a - mean_b
, pval = p.value
)
}) %>%
bind_rows()
}) %>%
bind_rows()
这看起来像这样:
# A tibble: 56 x 6
group_a group_b mean_a mean_b diff pval
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A B -0.275 0.0851 -0.360 0.384
2 A C -0.275 -0.651 0.376 0.406
3 A D -0.275 -0.440 0.165 0.737
4 A E -0.275 0.336 -0.611 0.245
5 A F -0.275 -0.277 0.00233 0.996
6 A G -0.275 -0.115 -0.160 0.754
7 A H -0.275 -0.406 0.131 0.821
8 B A 0.0851 -0.275 0.360 0.384
9 B C 0.0851 -0.651 0.736 0.0748
10 B D 0.0851 -0.440 0.525 0.245
# ... with 46 more rows
虽然长格式对于某些事情确实非常有用,例如绘制结果:
t_tests_out %>%
ggplot(aes(x = group_a
, y = group_b
, fill = pval)) +
geom_tile(col = "white") +
scale_fill_distiller(palette = "YlOrRd"
, limits = c(0,1)) +
theme_minimal()
您还可以分散结果以创建所需的表:
t_tests_out %>%
select(group_a, group_b, pval) %>%
spread(group_b, pval)
返回
# A tibble: 8 x 9
group_a A B C D E F G H
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA 0.384 0.406 0.737 0.245 0.996 0.754 0.821
2 B 0.384 NA 0.0748 0.245 0.595 0.439 0.668 0.371
3 C 0.406 0.0748 NA 0.659 0.0632 0.456 0.291 0.668
4 D 0.737 0.245 0.659 NA 0.163 0.762 0.547 0.955
5 E 0.245 0.595 0.0632 0.163 NA 0.280 0.425 0.243
6 F 0.996 0.439 0.456 0.762 0.280 NA 0.770 0.835
7 G 0.754 0.668 0.291 0.547 0.425 0.770 NA 0.640
8 H 0.821 0.371 0.668 0.955 0.243 0.835 0.640 NA
答案 3 :(得分:0)
您正在寻找pairwise.t.test。它允许您提及p值调整方法以及替代假设。有关详细信息,请参见 R 文档。
用法:
pairwise.t.test(x, g, p.adjust.method = p.adjust.methods,
pool.sd = !paired, paired = FALSE,
alternative = c("two.sided", "less", "greater"),
...)
对于您的情况,您可以执行以下操作:
pairwise.ttest <- pairwise.t.test(x = df$num_var, g = df$group_var)
pairwise.ttest$p.value