如何为数据帧的每一行创建一个列联表

时间:2019-03-30 01:58:36

标签: r dplyr apply contingency

我有一个很大的数据框,其中行作为物种,从2年开始算作列。我想为每一行创建一个列联表,以测试从第一年到第二年是否有重大变化(减少)。这是类似的伪装数据:

Species   2016    2017
cat        14      8
dog        16      12
bird       10      5

然后对于每一行我想要一个像这样的表:

cat       2017 2018
present   14    8
absent     0    6

dog       2017  2018
present   16    12
absent     0    4

bird      2017  2018
present    10    5
absent      0    5

然后,我将在每张桌子上进行Fisher精确检验,以测试下降幅度是否显着。

我认为这可以用dplyr来完成,也可以通过类似于下面链接的行遍历循环,但是不确定如何首先构建正确的表列表。 How to convert data frame to contingency table in R?

我一次只排一行:

A <- df[1,1:3]
A[2,] <- 0
A[2,3] <- (A[1,2] - A[1,3])
fisher.test(A[2:3])

关于如何将此方法应用于大量行的建议将不胜感激!我的大脑真的很难编码。

2 个答案:

答案 0 :(得分:1)

这是一个使用基数R的解决方案。您可能可以使用此答案中的一些想法来做一个更简洁的答案。让我知道这是否适合您!

# Create dataframe
df <- data.frame(Species = c("cat", "dog", "bird"),
                 year_2016 = c(14, 16, 10),
                 year_2017 = c(8, 12, 5), 
                 stringsAsFactors = F)

# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017

# Tranpose the dataframe to use lapply
df_t <- t(df)
colnames(df_t) <- as.vector(df_t[1,])
df_t <- df_t[-1,]
class(df_t) <- "numeric"

# Use lapply to create matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- colnames(df_t)
matrix_list
$cat
     [,1] [,2]
[1,]   14    8
[2,]    0    6

$dog
     [,1] [,2]
[1,]   16   12
[2,]    0    4

$bird
     [,1] [,2]
[1,]   10    5
[2,]    0    5

# Lots of fisher.tests
lapply(matrix_list, fisher.test)
$cat

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.01594
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.516139      Inf
sample estimates:
odds ratio 
       Inf 


$dog

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.1012
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7200866       Inf
sample estimates:
odds ratio 
       Inf 


$bird

    Fisher's Exact Test for Count Data

data:  X[[i]]
p-value = 0.03251
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.195396      Inf
sample estimates:
odds ratio 
       Inf 

然后,如果需要p值,可以使用sapply将其作为向量:

sapply(tests, "[[", "p.value")
       cat        dog       bird 
0.01594203 0.10122358 0.03250774 

编辑:这可能是一个轻微的改进。更加简洁。如果您担心性能(或者您要运行大量测试),今天我可以检查一下microbenchmark的缩放比例。另外,请记住使用所有这些测试来惩罚这些p值;)。另外,如果您喜欢tidyverse而不是base,则@tmfmnk发布了一个很棒的tidyverse解决方案。

# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
df_t <- t(df[-1]) # tranpose dataframe excluding column of species

# Use lapply to create the list of matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- df$Species

# Running the fisher's test on every matrix 
# in the list and extracting the p-values
tests <- lapply(matrix_list, fisher.test)
sapply(tests, "[[", "p.value")
       cat        dog       bird 
0.01594203 0.10122358 0.03250774 

最后编辑。能够通过microbenchmark运行它们,并希望为以后遇到此帖子的任何人发布结果:

Unit: milliseconds

expr           min    lq     mean   median uq     max     neval
tidyverse_sol  12.506 13.497 15.130 14.560 15.827 26.205  100
base_sol       1.120  1.162  1.339  1.225  1.296  5.712   100

答案 1 :(得分:1)

一种tidyverse可能是:

library(tidyverse)
library(broom)

df %>%
 rowid_to_column() %>%
 gather(var, present, -c(Species, rowid)) %>%
 arrange(rowid, var) %>%
 group_by(rowid) %>%
 mutate(absent = lag(present, default = first(present)) - present) %>%
 ungroup() %>%
 select(-rowid, -var) %>%
 nest(present, absent) %>%
 mutate(p_value = data %>%
         map(~fisher.test(.)) %>%
         map(tidy) %>%
         map_dbl(pluck, "p.value")) %>%
 select(-data)

  Species p_value
  <chr>     <dbl>
1 cat      0.0159
2 dog      0.101 
3 bird     0.0325

首先,执行从宽到长的数据转换,但不包括“种类”列和引用行ID的列。其次,它根据行ID以及按行ID引用年份和组的原始列名来排列数据。第三,它计算年份之间的差异。最后,它嵌套每个物种的当前变量和不存在的变量,并执行fisher.test,然后返回每个物种的p值。