比较两个数据框之间环境的影响

时间:2020-01-27 15:03:01

标签: r

我有两个数据框,如下所示:

a <- structure(list(Bacteria_A = c(12, 23, 45, 32, 34, 0), Bacteria_B = c(23, 
12, 33, 44, 55, 3), Bacteria_C = c(25, 10, 50, 38, 3, 34), Group = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "soil")), class = "data.frame", row.names = c("Sample_1", 
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))

b <- structure(list(Bacteria_A = c(14, 10, 40, 40, 37, 3), Bacteria_B = c(25, 
14, 32, 23, 45, 35), Bacteria_C = c(12, 34, 45, 22, 7, 23), Group = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "water")), class = "data.frame", row.names = c("Sample_1", 
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))

> a
         Bacteria_A Bacteria_B Bacteria_C Group
Sample_1         12         23         25  soil
Sample_2         23         12         10  soil
Sample_3         45         33         50  soil
Sample_4         32         44         38  soil
Sample_5         34         55          3  soil
Sample_6          0          3         34  soil
> b
         Bacteria_A Bacteria_B Bacteria_C Group
Sample_1         14         25         12 water
Sample_2         10         14         34 water
Sample_3         40         32         45 water
Sample_4         40         23         22 water
Sample_5         37         45          7 water
Sample_6          3         35         23 water

我想比较土壤和水之间的样本之间的差异。

例如Bacteria_A,我想知道土壤和水之间是否存在差异。 Bacteria_B和Bacteria_c相同(我有900个细菌)。我虽然进行了t检验,但不确定如何使用两个数据框。

忘记提及并不是所有细菌都存在于两个数据框中,因此有可能一种环境中不存在一种细菌。如果在两个环境中都发现细菌,它们的名称将完全相同。

原始数据帧是每500个细菌160个样本,数据不是正态分布的。

感谢您的帮助。

3 个答案:

答案 0 :(得分:1)

首先,我想提一下,有一些统计方法可以进行比较,而这些方法比t检验更合适。他们考虑到了数字的分布(通常为负二项式)。例如,您可以检查我们的DESeq2软件包。 关于您的技术问题,我会这样做:

for (bac in setdiff(intersect(colnames(a), colnames(b)), "Group")){
  print(t.test(a[,bac], b[,bac]))
}

答案 1 :(得分:0)

这将找到两个数据帧中都存在的细菌名称,然后在相同名称之间进行t.test,以给出列表L,其名称为细菌名称。最后一行使用tidyL转换为数据帧。如果您希望使用非参数测试,可以将t.test替换为wilcox.test。 (当然,这并没有考虑执行多个假设检验的问题,而只是考虑了计算。)

Name <- intersect(names(Filter(is.numeric, a)), names(Filter(is.numeric, b)))
L <- Map(t.test, a[Name], b[Name])

library(broom)
cbind(Name, do.call("rbind", lapply(L, tidy)))

最后一行给出以下数据框:

                 Name   estimate estimate1 estimate2   statistic   p.value
Bacteria_A Bacteria_A  0.3333333  24.33333  24.00000  0.03485781 0.9728799
Bacteria_B Bacteria_B -0.6666667  28.33333  29.00000 -0.07312724 0.9435532
Bacteria_C Bacteria_C  2.8333333  26.66667  23.83333  0.30754940 0.7650662
           parameter  conf.low conf.high                  method alternative
Bacteria_A  9.988603 -20.97689  21.64356 Welch Two Sample t-test   two.sided
Bacteria_B  7.765869 -21.80026  20.46692 Welch Two Sample t-test   two.sided
Bacteria_C  9.492873 -17.84326  23.50993 Welch Two Sample t-test   two.sided

注意

LinesA <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1         12         23         25  soil
Sample_2         23         12         10  soil
Sample_3         45         33         50  soil
Sample_4         32         44         38  soil
Sample_5         34         55          3  soil
Sample_6          0          3         34  soil"

LinesB <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1         14         25         12 water
Sample_2         10         14         34 water
Sample_3         40         32         45 water
Sample_4         40         23         22 water
Sample_5         37         45          7 water
Sample_6          3         35         23 water"

a <- read.table(text = LinesA, as.is = TRUE)
b <- read.table(text = LinesB, as.is = TRUE)

答案 2 :(得分:0)

您的values似乎不处于正态分布或接近正态分布,因此您应远离t检验。如果不确定要处理的发行版,可以使用wilcox.test

您可以轻松地将两个数据帧粘贴在一起,然后在运行适当的测试之前将它们转换为长格式:

library(tidyr)
library(dplyr)

bind_rows(a,b) %>% 
pivot_longer(c(Bacteria_A, Bacteria_B, Bacteria_C)) %>% 
group_by(name) %>% 
summarise(mean_soil = mean(value[Group == "soil"]), 
          mean_water = mean(value[Group == "water"]), 
          pvalue = wilcox.test(value ~ Group)$p.value)

哪个给你

#> # A tibble: 3 x 4
#>   name       mean_soil mean_water pvalue
#>   <chr>          <dbl>      <dbl>  <dbl>
#> 1 Bacteria_A      24.3       24    0.936
#> 2 Bacteria_B      28.3       29    0.873
#> 3 Bacteria_C      26.7       23.8  0.748