使用R,将多个卡方应变表测试应用于分组数据框,并添加包含测试p值的新列

时间:2018-04-04 19:39:09

标签: r tidyverse chi-squared

我有一个类似于下面例子的数据框(这是我实际数据框的一个小提取)。

frequencies <- data.frame(sex=c("female", "female", "male", "male", "female", "female", "male", "male", "female", "female", "male", "male", "female", "female", "male", "male"),
                      ecotype=c("Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave"),
                      contig_ID=c("Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", 
                                  "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481"),
                      allele=c("p", "p", "p", "p", "q", "q", "q", "q", "p", "p", "p", "p", "q", "q", "q", "q"),
                      frequency=c(157, 98, 140, 65, 29, 8, 26, 9, 182, 108, 147, 80, 46, 4, 49, 4))

frequencies data frame

我想对'contig_ID'和'ecotype'的每个组合进行单独的卡方检验,测试'性'和'等位基因'之间的关联。然后,我想在表格中总结这些结果,其中包括'contig_ID'和'ecotype'的每个组合的p值。例如,从给出的示例表中,我期望一个4 p值的结果表,如下例所示。

results <- data.frame(ecotype=c("Crab", "Wave", "Crab", "Wave"),
                  contig_ID=c("Contig100169_2367", "Contig100169_2367", "Contig100169_2481", "Contig100169_2481"),
                  pvalue=c("pval", "pval", "pval", "pval"))

results data frame

或者,只需将p值列添加到原始表中也是有效的,每个组合的p值只在所有相关行中重复。

我一直在尝试将lapply()summarise()等功能与chisq.test()结合使用来实现这一目标,但到目前为止还没有运气。我还试图使用类似于此的方法:R chi squared test (3x2 contingency table) for each row in a table,但也无法使其工作。

1 个答案:

答案 0 :(得分:1)

我们可以对import greenlet from 'greenlet' const getName = greenlet(async username => { const url = `https://api.github.com/users/${username}` const res = await fetch(url) const profile = await res.json() return profile.name }) console.log(await getName('developit')) contig_ID列进行分组,并创建一个嵌套数据框,并将数据转换为矩阵,如下所示。

ecotype

如果我们查看library(tidyverse) frequencies2 <- frequencies %>% group_by(contig_ID, ecotype) %>% nest() %>% mutate(M = map(data, function(dat){ dat2 <- dat %>% spread(sex, frequency) M <- as.matrix(dat2[, -1]) row.names(M) <- dat2$allele return(M) })) 列的第一个元素,我们会发现每个组的数据都转换为矩阵。

M

从这里开始,我们可以将frequencies2$M[[1]] # female male # p 157 140 # q 29 26 应用于每个矩阵并拉出p值。 chisq.test是最终输出。

frequencies3