Question

我有一个很大的数据框，其中行作为物种，从2年开始算作列。我想为每一行创建一个列联表，以测试从第一年到第二年是否有重大变化（减少）。这是类似的伪装数据：

Species   2016    2017
cat        14      8
dog        16      12
bird       10      5

然后对于每一行我想要一个像这样的表：

cat       2017 2018
present   14    8
absent     0    6

dog       2017  2018
present   16    12
absent     0    4

bird      2017  2018
present    10    5
absent      0    5

然后，我将在每张桌子上进行Fisher精确检验，以测试下降幅度是否显着。

我认为这可以用dplyr来完成，也可以通过类似于下面链接的行遍历循环，但是不确定如何首先构建正确的表列表。 How to convert data frame to contingency table in R?

我一次只排一行：

A <- df[1,1:3]
A[2,] <- 0
A[2,3] <- (A[1,2] - A[1,3])
fisher.test(A[2:3])

关于如何将此方法应用于大量行的建议将不胜感激！我的大脑真的很难编码。

Answer 1

这是一个使用基数R的解决方案。您可能可以使用此答案中的一些想法来做一个更简洁的答案。让我知道这是否适合您！

# Create dataframe
df <- data.frame(Species = c("cat", "dog", "bird"),
                 year_2016 = c(14, 16, 10),
                 year_2017 = c(8, 12, 5), 
                 stringsAsFactors = F)

# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017

# Tranpose the dataframe to use lapply
df_t <- t(df)
colnames(df_t) <- as.vector(df_t[1,])
df_t <- df_t[-1,]
class(df_t) <- "numeric"

# Use lapply to create matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- colnames(df_t)
matrix_list
$cat
     [,1] [,2]
[1,]   14    8
[2,]    0    6

$dog
     [,1] [,2]
[1,]   16   12
[2,]    0    4

$bird
     [,1] [,2]
[1,]   10    5
[2,]    0    5

# Lots of fisher.tests
lapply(matrix_list, fisher.test)
$cat

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.01594
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.516139      Inf
sample estimates:
odds ratio 
       Inf 


$dog

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.1012
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7200866       Inf
sample estimates:
odds ratio 
       Inf 


$bird

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.03251
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.195396      Inf
sample estimates:
odds ratio 
       Inf

然后，如果需要p值，可以使用sapply将其作为向量：

sapply(tests, "[[", "p.value")
       cat        dog       bird 
0.01594203 0.10122358 0.03250774

编辑：这可能是一个轻微的改进。更加简洁。如果您担心性能（或者您要运行大量测试），今天我可以检查一下microbenchmark的缩放比例。另外，请记住使用所有这些测试来惩罚这些p值;）。另外，如果您喜欢tidyverse而不是base，则@tmfmnk发布了一个很棒的tidyverse解决方案。

# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
df_t <- t(df[-1]) # tranpose dataframe excluding column of species

# Use lapply to create the list of matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- df$Species

# Running the fisher's test on every matrix 
# in the list and extracting the p-values
tests <- lapply(matrix_list, fisher.test)
sapply(tests, "[[", "p.value")
       cat        dog       bird 
0.01594203 0.10122358 0.03250774

最后编辑。能够通过microbenchmark运行它们，并希望为以后遇到此帖子的任何人发布结果：

Unit: milliseconds

expr           min    lq     mean   median uq     max     neval
tidyverse_sol  12.506 13.497 15.130 14.560 15.827 26.205  100
base_sol       1.120  1.162  1.339  1.225  1.296  5.712   100

Answer 2

一种tidyverse可能是：

library(tidyverse)
library(broom)

df %>%
 rowid_to_column() %>%
 gather(var, present, -c(Species, rowid)) %>%
 arrange(rowid, var) %>%
 group_by(rowid) %>%
 mutate(absent = lag(present, default = first(present)) - present) %>%
 ungroup() %>%
 select(-rowid, -var) %>%
 nest(present, absent) %>%
 mutate(p_value = data %>%
         map(~fisher.test(.)) %>%
         map(tidy) %>%
         map_dbl(pluck, "p.value")) %>%
 select(-data)

  Species p_value
  <chr>     <dbl>
1 cat      0.0159
2 dog      0.101 
3 bird     0.0325

首先，执行从宽到长的数据转换，但不包括“种类”列和引用行ID的列。其次，它根据行ID以及按行ID引用年份和组的原始列名来排列数据。第三，它计算年份之间的差异。最后，它嵌套每个物种的当前变量和不存在的变量，并执行fisher.test，然后返回每个物种的p值。

如何为数据帧的每一行创建一个列联表

2 个答案: