在R中矢量化多个Wilcoxon测试

时间:2014-04-13 12:37:32

标签: r statistics

对这个问题的不雅和示例表示道歉。我是一名医生,他在R中的深度编码非常出色,但我想要变得更好

我需要在R中的数据集上执行多个Wilcoxon测试。(我知道多重比较的危险;事实上,这用于从一组LME分析中继续使用,以便使用Hodges-Lehman估计)。

我的数据包含多个变量,在多个科目的多个时间点测量。我想有一种比较不同时间点的方法,为每次比较创建一个新的'htest'对象。

这是我的数据帧结构的MWE近似值:

example.data <- data.frame(
                matrix(data=c(
                'A',0,0,24,0,
                'A',1,1,20,-1,
                'A',2,2,18,-1.4,
                'A',3,0.5,21,-0.6,
                'B',0,0,22,0,
                'B',1,1.2,19,-2.2,
                'B',2,1.8,20,-3,
                'B',3,0.3,21,-1,
                'C',0,0,24,0,
                'C',1,0.8,22,0.1,
                'C',2,2.2,16,-0.6,
                'C',3,1,23,-0.2,
                'D',0,0,33,0,
                'D',1,6,31,-0.4,
                'D',2,6.3,27,-0.3,
                'D',3,2.2,31,-0.1),
                nrow=16,byrow=T))
colnames(example.data) <- c('Subject','Timepoint','Variable1','Variable2','Variable3')
example.data$Timepoint = factor(example.data$Timepoint,levels=c(0,1,2,3))
example.data[,3:5] = sapply(example.data[,3:5],as.numeric)

我能想到的最好方法是使用一个非常丑陋的for循环,看起来像这样:

## Step 2 - Multiple Wilcoxons

variablenames <- names(example.data)[-c(1,2)]

for (obj in variablenames[3:5]){
    obj.wilcoxon.Timepoint1 <- toString(paste(obj,'.wilcoxon.Timepoint1',sep='')) # create 100percent object name
    obj.wilcoxon.Timepoint2 <- toString(paste(obj,'.wilcoxon.timepoint2',sep='')) # create 100percent object name
    obj.wilcoxon.Timepoint3 <- toString(paste(obj,'.wilcoxon.timepoint3',sep='')) # create 100percent object name
        assign(eval(obj.wilcoxon.Timepoint1),wilcox.test(example.data[example.data$Timepoint==0,which(variablenames == obj)],example.data[example.data$Timepoint==1,which(variablenames == obj)],conf.int=T,paired=T))

        assign(eval(obj.wilcoxon.Timepoint2),wilcox.test(example.data[example.data$Timepoint==0,which(variablenames == obj)],example.data[example.data$Timepoint==2,which(variablenames == obj)],conf.int=T,paired=T))

        assign(eval(obj.wilcoxon.Timepoint3),wilcox.test(example.data[example.data$Timepoint==0,which(variablenames == obj)],example.data[example.data$Timepoint==3,which(variablenames == obj)],conf.int=T,paired=T))
}

我确信这是一种优雅的,矢量化的方式,但我该怎么办?

2 个答案:

答案 0 :(得分:1)

首先:

example.data[,3:5] = sapply(example.data[,3:5],as.numeric)

应该是

example.data[,3:5] = apply(example.data[,3:5],2,as.numeric)

以下内容应该为您提供更紧凑的解决方案。

首先,加载这两个库。根据Roland的建议,reshape2将数据转换为长格式,dplyrplyr的更快版本。

library(reshape2)
library(dplyr)

将数据转换为所需格式

baseline = melt(example.data %.% filter(Timepoint==0) %.% select(-Timepoint), 
        "Subject", value.name = "base")
comparison = melt(example.data %.% filter(Timepoint!=0), c("Subject", "Timepoint"))
join.data = left_join(comparison, baseline)

您可以看到join.data的样子:

> join.data
   Subject Timepoint  variable value base
1        A         1 Variable1   1.0    0
2        A         2 Variable1   2.0    0
3        A         3 Variable1   0.5    0
4        B         1 Variable1   1.2    0
5        B         2 Variable1   1.8    0
6        B         3 Variable1   0.3    0
7        C         1 Variable1   0.8    0
8        C         2 Variable1   2.2    0
9        C         3 Variable1   1.0    0
10       D         1 Variable1   6.0    0
11       D         2 Variable1   6.3    0
12       D         3 Variable1   2.2    0
13       A         1 Variable2  20.0   24
14       A         2 Variable2  18.0   24
15       A         3 Variable2  21.0   24
16       B         1 Variable2  19.0   22
17       B         2 Variable2  20.0   22
18       B         3 Variable2  21.0   22
19       C         1 Variable2  22.0   24
20       C         2 Variable2  16.0   24
21       C         3 Variable2  23.0   24
22       D         1 Variable2  31.0   33
23       D         2 Variable2  27.0   33
24       D         3 Variable2  31.0   33
25       A         1 Variable3  -1.0    0
26       A         2 Variable3  -1.4    0
27       A         3 Variable3  -0.6    0
28       B         1 Variable3  -2.2    0
29       B         2 Variable3  -3.0    0
30       B         3 Variable3  -1.0    0
31       C         1 Variable3   0.1    0
32       C         2 Variable3  -0.6    0
33       C         3 Variable3  -0.2    0
34       D         1 Variable3  -0.4    0
35       D         2 Variable3  -0.3    0
36       D         3 Variable3  -0.1    0

最后,主菜

res = join.data %.% group_by(variable) %.% do(
        function(df) {
                df %.% group_by(Timepoint) %.% do (
                    function(d) wilcox.test(d$base, d$value, conf.int=TRUE, paired=TRUE)
                    )
        })

res是一个列表清单:res[[i]][[t]]是变量i在时间点't

的结果

例如,res[[1]][[2]]是变量1在时间点2的结果。


或者,您可以执行传统的split

res = lapply(split(join.data, join.data$variable),
    function(df){
        lapply(split(df, df$Timepoint), function(d){
           wilcox.test(d$base, d$value, conf.int= TRUE, paired=TRUE)
       })
    })

答案 1 :(得分:0)

由于wilcox.test没有矢量化,因此您无法在没有循环的情况下执行此操作。但是,您仍然可以比使用assigneval做得更好。这更像是R-ish:

library(reshape2)
#long format is better:
example.data <- melt(example.data, id.vars=c("Subject", "Timepoint"))

library(plyr)
#split-apply-combine
res <- dlply(example.data, .(Subject), 
             function(df) lapply(unique(df[df$Timepoint!="0", "Timepoint"]),
                                 function(i, DF) {
                                   wilcox.test(DF[DF$Timepoint=="0", "value"],
                                               DF[DF$Timepoint==i, "value"],
                                               conf.int=FALSE, paired=TRUE)                                            
                                 }, DF=df))

请注意,我设置了conf.int=FALSE以避免wilcox.test出现的错误,这可能是由于数据有限造成的。

您可以使用以下方式访问主题B的第二个测试:

res[["B"]][[2]]