我有两个数据框,如下所示:
a <- structure(list(Bacteria_A = c(12, 23, 45, 32, 34, 0), Bacteria_B = c(23,
12, 33, 44, 55, 3), Bacteria_C = c(25, 10, 50, 38, 3, 34), Group = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "soil")), class = "data.frame", row.names = c("Sample_1",
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))
b <- structure(list(Bacteria_A = c(14, 10, 40, 40, 37, 3), Bacteria_B = c(25,
14, 32, 23, 45, 35), Bacteria_C = c(12, 34, 45, 22, 7, 23), Group = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "water")), class = "data.frame", row.names = c("Sample_1",
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))
> a
Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 12 23 25 soil
Sample_2 23 12 10 soil
Sample_3 45 33 50 soil
Sample_4 32 44 38 soil
Sample_5 34 55 3 soil
Sample_6 0 3 34 soil
> b
Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 14 25 12 water
Sample_2 10 14 34 water
Sample_3 40 32 45 water
Sample_4 40 23 22 water
Sample_5 37 45 7 water
Sample_6 3 35 23 water
我想比较土壤和水之间的样本之间的差异。
例如Bacteria_A,我想知道土壤和水之间是否存在差异。 Bacteria_B和Bacteria_c相同(我有900个细菌)。我虽然进行了t检验,但不确定如何使用两个数据框。
忘记提及并不是所有细菌都存在于两个数据框中,因此有可能一种环境中不存在一种细菌。如果在两个环境中都发现细菌,它们的名称将完全相同。
原始数据帧是每500个细菌160个样本,数据不是正态分布的。
感谢您的帮助。
答案 0 :(得分:1)
首先,我想提一下,有一些统计方法可以进行比较,而这些方法比t检验更合适。他们考虑到了数字的分布(通常为负二项式)。例如,您可以检查我们的DESeq2软件包。 关于您的技术问题,我会这样做:
for (bac in setdiff(intersect(colnames(a), colnames(b)), "Group")){
print(t.test(a[,bac], b[,bac]))
}
答案 1 :(得分:0)
这将找到两个数据帧中都存在的细菌名称,然后在相同名称之间进行t.test,以给出列表L
,其名称为细菌名称。最后一行使用tidy
将L
转换为数据帧。如果您希望使用非参数测试,可以将t.test
替换为wilcox.test
。 (当然,这并没有考虑执行多个假设检验的问题,而只是考虑了计算。)
Name <- intersect(names(Filter(is.numeric, a)), names(Filter(is.numeric, b)))
L <- Map(t.test, a[Name], b[Name])
library(broom)
cbind(Name, do.call("rbind", lapply(L, tidy)))
最后一行给出以下数据框:
Name estimate estimate1 estimate2 statistic p.value
Bacteria_A Bacteria_A 0.3333333 24.33333 24.00000 0.03485781 0.9728799
Bacteria_B Bacteria_B -0.6666667 28.33333 29.00000 -0.07312724 0.9435532
Bacteria_C Bacteria_C 2.8333333 26.66667 23.83333 0.30754940 0.7650662
parameter conf.low conf.high method alternative
Bacteria_A 9.988603 -20.97689 21.64356 Welch Two Sample t-test two.sided
Bacteria_B 7.765869 -21.80026 20.46692 Welch Two Sample t-test two.sided
Bacteria_C 9.492873 -17.84326 23.50993 Welch Two Sample t-test two.sided
LinesA <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 12 23 25 soil
Sample_2 23 12 10 soil
Sample_3 45 33 50 soil
Sample_4 32 44 38 soil
Sample_5 34 55 3 soil
Sample_6 0 3 34 soil"
LinesB <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 14 25 12 water
Sample_2 10 14 34 water
Sample_3 40 32 45 water
Sample_4 40 23 22 water
Sample_5 37 45 7 water
Sample_6 3 35 23 water"
a <- read.table(text = LinesA, as.is = TRUE)
b <- read.table(text = LinesB, as.is = TRUE)
答案 2 :(得分:0)
您的values
似乎不处于正态分布或接近正态分布,因此您应远离t检验。如果不确定要处理的发行版,可以使用wilcox.test
。
您可以轻松地将两个数据帧粘贴在一起,然后在运行适当的测试之前将它们转换为长格式:
library(tidyr)
library(dplyr)
bind_rows(a,b) %>%
pivot_longer(c(Bacteria_A, Bacteria_B, Bacteria_C)) %>%
group_by(name) %>%
summarise(mean_soil = mean(value[Group == "soil"]),
mean_water = mean(value[Group == "water"]),
pvalue = wilcox.test(value ~ Group)$p.value)
哪个给你
#> # A tibble: 3 x 4
#> name mean_soil mean_water pvalue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Bacteria_A 24.3 24 0.936
#> 2 Bacteria_B 28.3 29 0.873
#> 3 Bacteria_C 26.7 23.8 0.748