对两个data.frames之间的所有列组合运行成对fisher测试

时间:2015-02-11 04:28:37

标签: r dataframe permutation

我有两个data.frames:editCounts和nonEditCounts。这些结构具有相同的尺寸并包含相同的列和行名称,但实际数据会有所不同。以下是每个人:

> head(editCounts)
                        Samp0         Samp1       Samp2
chr10_101992307             0             4           3
chr10_101992684             4             0           1
chr10_127480585             0             3           0
chr10_16479385              3             3           3
chr10_73979859              0             3           2
chr10_73979940              0             3           8
> head(nonEditCounts)
                        Samp0         Samp1       Samp2
chr10_101992307             0             4           3
chr10_101992684            15             0           4
chr10_127480585             0             6           0
chr10_16479385              7             7           4
chr10_73979859              0            13           7
chr10_73979940              0            21          10

这里的最终目标是在每个data.frames之间的每一列和每行上执行成对渔夫测试(使用fisher.test())。作为输出,我想创建一个表,其中包含与每个行名对应的每个成对比较的结果p值,例如:

               Samp0_vs_Samp1     Samp0_vs_Samp2     Samp1_vs_Samp2 
chr10_101992307          pval               pval               pval 
chr10_101992684          pval               pval               pval 
chr10_127480585          pval               pval               pval 
chr10_16479385           pval               pval               pval 
chr10_73979859           pval               pval               pval 
...                       ...                ...                ...

因此,以Samp0和Samp1为例,第一个Fisher测试将包含一个类似于此的矩阵:

    > tempMat=matrix(c(editCounts$ERR188028_GBR[1], nonEditCounts$ERR188028_GBR[1],
    +                  editCounts$ERR188035_GBR[1], nonEditCounts$ERR188035_GBR[1]), 2, 2)
    > tempMat
         [,1] [,2]
    [1,]    0    4
    [2,]    0    4

这些值对应第一行(chr10_101992307)。在这种情况下,Fisher测试将导致p值为1.

我知道我可以使用combn()来计算每个列的排列,但我不确定如何循环每个列,从4个值创建列联表,并运行fisher测试。我到目前为止写的代码如下所示;但是,在尝试创建tempMat时会抛出错误。

editCounts    <- read.table("editCountMatrix.txt", sep="\t", header=TRUE, row.names=1)
nonEditCounts <- read.table("nonEditCountMatrix.txt", sep="\t", header=TRUE, row.names=1)

pairwiseComb <- combn(names(editCounts),2)

for (j in seq(1,length(pairwiseComb),2)){
  tempCol1 = pairwiseComb[[j]]
  tempCol2 = pairwiseComb[[j+1]]
  cat("Processing: ",tempCol1," vs. ",tempCol2, "\n", sep="") # Prints correctly
  for (i in 1:nrow(editCounts)){
    tempMat=matrix(c(editCounts$tempCol1[i], nonEditCounts$tempCol1[i],
                 editCounts$tempCol2[i], nonEditCounts$tempCol2[i]), 2, 2)
    tempFisher=fisher.test(tempMat, alternative="two.sided")
    pval=tempFisher$p.value
    pvalAdj=p.adjust(pval,method="fdr")
  }
}

产生的错误如下所示:

Error in matrix(c(editCounts$tempCol1[i], nonEditCounts$tempCol1[i], editCounts$tempCol2[i],  : 
  'data' must be of a vector type, was 'NULL'

非常感谢任何帮助。

谢谢!

1 个答案:

答案 0 :(得分:0)

这是一个建议的解决方案,我已经用你的代码纠正了一些小的索引问题,并建议使用预先分配的矩阵来存储Fisher Exact测试结果。

# Create data.frames using your sample data.
editCounts <- read.table(header=TRUE,
text="                        Samp0         Samp1       Samp2
chr10_101992307             0             4           3
chr10_101992684             4             0           1
chr10_127480585             0             3           0
chr10_16479385              3             3           3
chr10_73979859              0             3           2
chr10_73979940              0             3           8")

nonEditCounts <- read.table(header=TRUE,
text="                        Samp0         Samp1       Samp2
chr10_101992307             0             4           3
chr10_101992684            15             0           4
chr10_127480585             0             6           0
chr10_16479385              7             7           4
chr10_73979859              0            13           7
chr10_73979940              0            21          10")

pairwiseComb <- combn(names(editCounts), 2)

# Create a matrix to hold results.
results <- matrix(NA, ncol=ncol(pairwiseComb), nrow=nrow(editCounts))

# Create row and column names to use for indexing/assignment of results.
rownames(results) <- rownames(editCounts)
colnames(results) <- apply(pairwiseComb, 2, 
                           function(x) {paste(x[1], "_vs_", x[2], sep="")})

# Loop over number of column pairs.
for (j in seq(ncol(pairwiseComb))) {
    tempCol1 <- pairwiseComb[1, j]
    tempCol2 <- pairwiseComb[2, j]
    resultsCol <- paste(tempCol1, "_vs_", tempCol2, sep="")
    cols <- c(tempCol1, tempCol2)
    # Loop over rownames.
    for (row in rownames(results)) {
        tempMat <- rbind(   editCounts[row, cols], # Grab values using row and
                         nonEditCounts[row, cols]) # column names. Use rbind to
                                                   # create two-row matrix.

        tempFisher <- fisher.test(tempMat, alternative="two.sided")
        results[row, resultsCol] <- tempFisher$p.value # Use row and column name
                                                       # indexing to assign
                                                       # p-value to results.
    }
}

# Compute adjusted p-values using all of the computed p-values, outside of loop.
padj <- results                           # First make copy of results matrix.  
padj[] <- p.adjust(results, method="fdr") # Trick to retain shape and attributes.

results
#                 Samp0_vs_Samp1 Samp0_vs_Samp2 Samp1_vs_Samp2
# chr10_101992307              1      1.0000000     1.00000000
# chr10_101992684              1      1.0000000     1.00000000
# chr10_127480585              1      1.0000000     1.00000000
# chr10_16479385               1      0.6436652     0.64366516
# chr10_73979859               1      1.0000000     1.00000000
# chr10_73979940               1      1.0000000     0.03290832

padj
#                 Samp0_vs_Samp1 Samp0_vs_Samp2 Samp1_vs_Samp2
# chr10_101992307              1              1      1.0000000
# chr10_101992684              1              1      1.0000000
# chr10_127480585              1              1      1.0000000
# chr10_16479385               1              1      1.0000000
# chr10_73979859               1              1      1.0000000
# chr10_73979940               1              1      0.5923497