蒙特卡罗基于置换方法估计ANOVA中F-统计量的p值

时间:2016-10-31 04:39:09

标签: r permutation lm anova p-value

我正在尝试使用组标识符g对(y 1 ,...,y N )的ANOVA进行置换测试。我应该使用(1)/(g-1)求和(muhat j - muhat)^ 2作为检验统计量,而muhat j 是第j组样本均值,和muhat =(1 / g)求和muhat j

## data
y <- c(6.59491, 6.564573, 6.696147, 6.321552, 6.588449, 6.853832, 
6.370895, 6.441823, 6.227591, 6.675492, 6.255462, 6.919716, 6.837458, 
6.41374, 6.543782, 6.562947, 6.570343, 6.993634, 6.666261, 7.082319, 
7.210933, 6.547977, 6.330553, 6.309289, 6.913492, 6.597188, 6.247285, 
6.644366, 6.534671, 6.885325, 6.577568, 6.499041, 6.827574, 6.198853, 
6.965038, 6.58837, 6.498529, 6.449476, 6.544842, 6.496817, 6.499526, 
6.709674, 6.946934, 6.23884, 6.517018, 6.206692, 6.491935, 6.039925, 
6.166948, 6.160605, 6.428338, 6.564948, 6.446658, 6.566979, 7.17546, 
6.45031, 6.612242, 6.559798, 6.568082, 6.44193, 6.295211, 6.446384, 
6.658321, 6.369639, 6.066747, 6.345537, 6.727513, 6.677873, 6.889841, 
6.724438, 6.379956, 6.380779, 6.50096, 6.676555, 6.463236, 6.239091, 
6.797642, 6.608025)

## group
g <- structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 
3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 1L, 
2L), .Label = c("B1", "B2", "B3"), class = "factor")

这就是我现在所拥有的,但是当我更改它以测试样本均值而不是F统计量时它不起作用。我非常确定我需要将T.obsT.perm更改为与by(y, g, mean)类似的内容,但我认为还有更多我遗漏的内容。

n <- length(y) #sample size n
T.obs<- anova(lm(y ~ g))$F[1]   #Observed statistic 
n.perm <- 2000   # we will do 2000 permutations
T.perm <- rep(NA, n.perm)   #A vector to save permutated statistic
for(i in 1:n.perm) {
  y.perm <- sample(y, n, replace=F)   #permute data
  T.perm[i] <- anova(lm(y.perm ~ g))$F[1]   #Permuted statistic
  }
mean(T.perm >= T.obs)   #p-value

1 个答案:

答案 0 :(得分:2)

我真的不知道你的意思是“它不能正常工作”。据我所见,它可以正常工作,除了它稍慢。

set.seed(0)
n <- length(y) #sample size n
T.obs <- anova(lm(y ~ g))$F[1]   #Observed statistic 
n.perm <- 2000   # we will do 2000 permutations
T.perm <- rep(NA, n.perm)   #A vector to save permutated statistic
for(i in 1:n.perm) {
  y.perm <- sample(y, n, replace=F)   #permute data
  T.perm[i] <- anova(lm(y.perm ~ g))$F[1]   #Permuted statistic
  }
mean(T.perm >= T.obs)
# [1] 0.4915

这非常接近理论值

anova(lm(y ~ g))$Pr[1]
# [1] 0.4823429

所以,是的,你做的都是正确的!

从问题的第一段开始,我们想要自己计算F统计量,因此以下函数可以做到这一点。有一个开关"use_lm"。如果设置为TRUE,则会使用anova(lm(y ~ g))作为原始代码中的内容。 此功能旨在计算F统计量和p值透明。此外,手动计算比调用lmanova快15倍(这是显而易见的事情......)。

fstat <- function (y, g, use_lm = FALSE) {
  if (!use_lm) {
    ## group mean (like we are fitting a linear model A: `y ~ g`)
    mu_g <- ave(y, g, FUN = mean)
    ## overall mean (like we are fitting a linear model B: `y ~ 1`)
    mu <- mean(y)
    ## RSS (residual sum of squares) for model A
    RSS_A <- drop(crossprod(y - mu_g))
    ## RSS (residual sum of squares) for model B
    RSS_B <- drop(crossprod(y - mu))
    ## increase of RSS from model A to model B
    RSS_inc <- RSS_B - RSS_A
    ## note, according to "partition of squares", we can also compute `RSS_inc` as
    ## RSS_inc <- drop(crossprod(mu_g - mu))
    ## `sigma2` (estimated residual variance) of model A
    sigma2 <- RSS_A / (length(y) - nlevels(g))
    ## F-statistic
    fstatistic <- ( RSS_inc / (nlevels(g) - 1) ) / sigma2
    ## p-value
    pval <- pf(fstatistic, nlevels(g) - 1, length(y) - nlevels(g), lower.tail = FALSE)
    ## retern
    return(c(F = fstatistic, pval = pval))
    }
  else {
    anovalm <- anova(lm(y ~ g))
    return(c(F = anovalm$F[1L], pval = anovalm$Pr[1L]))
    }
  }

让我们首先检查一下这个函数的有效性:

F_obs <- fstat(y, g)
#        F      pval 
#0.7362340 0.4823429 

F_obs <- fstat(y, g, TRUE)
#        F      pval 
#0.7362340 0.4823429

不要惊讶它是微不足道的。您的数据并不能真正表明显着的群体差异。看一下箱图:

boxplot(y ~ g)    ## or use "factor" method of `plot` function: `plot(g, y)`

enter image description here

现在我们继续进行排列。为此,我们编写了另一个函数perm。它实际上非常简单,因为我们有一个很好定义的fstat。我们需要做的就是使用replicate来结束sample + fstat

lm实际上非常慢:

library(microbenchmark)
microbenchmark(fstat(y, g), fstat(y, g, TRUE), times = 200)

#Unit: microseconds
#              expr     min      lq      mean  median      uq      max neval cld
#       fstat(y, g)  228.44  235.32  272.1204  275.34  290.20   388.84   200  a 
# fstat(y, g, TRUE) 4090.00 4136.72 4424.0470 4181.02 4450.12 16460.72   200   b

所以我们使用f(..., use_lm = FALSE)编写此函数:

perm <- function (y, g, n) replicate(n, fstat(sample(y), g)[[1L]])

现在让我们用n = 2000运行它(设置随机种子以获得再现性):

set.seed(0)
F_perm <- perm(y, g, 2000)

## estimated p-value based on permutation
mean(F_perm > F_obs[[1L]])
# [1] 0.4915

注意它与理论p值的接近程度:

F_obs[[2L]]
# [1] 0.4823429

如您所见,结果与原始代码一致。