我有一个很大的数据框,其中行作为物种,从2年开始算作列。我想为每一行创建一个列联表,以测试从第一年到第二年是否有重大变化(减少)。这是类似的伪装数据:
Species 2016 2017
cat 14 8
dog 16 12
bird 10 5
然后对于每一行我想要一个像这样的表:
cat 2017 2018
present 14 8
absent 0 6
dog 2017 2018
present 16 12
absent 0 4
bird 2017 2018
present 10 5
absent 0 5
然后,我将在每张桌子上进行Fisher精确检验,以测试下降幅度是否显着。
我认为这可以用dplyr来完成,也可以通过类似于下面链接的行遍历循环,但是不确定如何首先构建正确的表列表。 How to convert data frame to contingency table in R?
我一次只排一行:
A <- df[1,1:3]
A[2,] <- 0
A[2,3] <- (A[1,2] - A[1,3])
fisher.test(A[2:3])
关于如何将此方法应用于大量行的建议将不胜感激!我的大脑真的很难编码。
答案 0 :(得分:1)
这是一个使用基数R的解决方案。您可能可以使用此答案中的一些想法来做一个更简洁的答案。让我知道这是否适合您!
# Create dataframe
df <- data.frame(Species = c("cat", "dog", "bird"),
year_2016 = c(14, 16, 10),
year_2017 = c(8, 12, 5),
stringsAsFactors = F)
# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
# Tranpose the dataframe to use lapply
df_t <- t(df)
colnames(df_t) <- as.vector(df_t[1,])
df_t <- df_t[-1,]
class(df_t) <- "numeric"
# Use lapply to create matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- colnames(df_t)
matrix_list
$cat
[,1] [,2]
[1,] 14 8
[2,] 0 6
$dog
[,1] [,2]
[1,] 16 12
[2,] 0 4
$bird
[,1] [,2]
[1,] 10 5
[2,] 0 5
# Lots of fisher.tests
lapply(matrix_list, fisher.test)
$cat
Fisher's Exact Test for Count Data
data: X[[i]]
p-value = 0.01594
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.516139 Inf
sample estimates:
odds ratio
Inf
$dog
Fisher's Exact Test for Count Data
data: X[[i]]
p-value = 0.1012
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7200866 Inf
sample estimates:
odds ratio
Inf
$bird
Fisher's Exact Test for Count Data
data: X[[i]]
p-value = 0.03251
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.195396 Inf
sample estimates:
odds ratio
Inf
然后,如果需要p值,可以使用sapply
将其作为向量:
sapply(tests, "[[", "p.value")
cat dog bird
0.01594203 0.10122358 0.03250774
编辑:这可能是一个轻微的改进。更加简洁。如果您担心性能(或者您要运行大量测试),今天我可以检查一下microbenchmark
的缩放比例。另外,请记住使用所有这些测试来惩罚这些p值;)。另外,如果您喜欢tidyverse而不是base,则@tmfmnk发布了一个很棒的tidyverse
解决方案。
# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
df_t <- t(df[-1]) # tranpose dataframe excluding column of species
# Use lapply to create the list of matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- df$Species
# Running the fisher's test on every matrix
# in the list and extracting the p-values
tests <- lapply(matrix_list, fisher.test)
sapply(tests, "[[", "p.value")
cat dog bird
0.01594203 0.10122358 0.03250774
最后编辑。能够通过microbenchmark
运行它们,并希望为以后遇到此帖子的任何人发布结果:
Unit: milliseconds
expr min lq mean median uq max neval
tidyverse_sol 12.506 13.497 15.130 14.560 15.827 26.205 100
base_sol 1.120 1.162 1.339 1.225 1.296 5.712 100
答案 1 :(得分:1)
一种tidyverse
可能是:
library(tidyverse)
library(broom)
df %>%
rowid_to_column() %>%
gather(var, present, -c(Species, rowid)) %>%
arrange(rowid, var) %>%
group_by(rowid) %>%
mutate(absent = lag(present, default = first(present)) - present) %>%
ungroup() %>%
select(-rowid, -var) %>%
nest(present, absent) %>%
mutate(p_value = data %>%
map(~fisher.test(.)) %>%
map(tidy) %>%
map_dbl(pluck, "p.value")) %>%
select(-data)
Species p_value
<chr> <dbl>
1 cat 0.0159
2 dog 0.101
3 bird 0.0325
首先,执行从宽到长的数据转换,但不包括“种类”列和引用行ID的列。其次,它根据行ID以及按行ID引用年份和组的原始列名来排列数据。第三,它计算年份之间的差异。最后,它嵌套每个物种的当前变量和不存在的变量,并执行fisher.test
,然后返回每个物种的p值。