R:计算随机分布的P值

时间:2017-07-15 15:09:35

标签: r statistics montecarlo p-value

我想得到两个随机分布的观测值x和y的P值,例如:

> set.seed(0)
> x <- rnorm(1000, 3, 2)
> y <- rnorm(2000, 4, 3)

或:

> set.seed(0)
> x <- rexp(50, 10)
> y <- rexp(100, 11)

假设T是我的测试统计量,定义为mean(x) - mean(y)= 0(这是H0),P值则定义为:p-value = P [T> T_observed | H0持有]。
我试过这样做:

> z <- c(x,y) # if H0 holds then x and y are distributed with the same distribution
> f <- function(x) ecdf(z) # this will get the distribution of z (x and y)

然后计算p值我试过这个:

> T <- replicate(10000, mean(sample(z,1000,TRUE))-mean(sample(z,2000,TRUE))) # this is 
supposed to get the null distribution of mean(x) - mean(y)
> f(quantile(T,0.05)) # calculating the p-value for a significance of 5%

显然这似乎不起作用,我缺少什么?

1 个答案:

答案 0 :(得分:1)

您的意图非常好 - 通过自举采样(也称为自举)计算统计显着性。然而,均值(样本(x,1000,TRUE)) - 均值(样本(z,2000,TRUE))不能工作,因为这是平均1000个样本的z - 平均2000个样本的z。无论x和y的真实意义如何,这肯定会非常接近于0。

我建议如下:

diff <- (sample(x, size = 2000, replace = TRUE) - sample(y, size = 2000, replace = TRUE))

采用x和y的2000个样本(替换)并计算差异。当然,您也可以按照建议添加复制来增加信心。与pvalue相反,我更喜欢置信区间(CI),因为我认为它们更具信息性(并且与p值的统计准确度相当)。然后可以使用均值和标准误差如下计算CI:

stderror <- sd(diff)/sqrt(length(x))
upperCI <- mean(diff)+stderror
lowerCI <- mean(diff)-stderror
cat(lowerCI, upperCI)

由于CI不包括0,因此拒绝原假设。请注意,结果将接近t-test(对于您的正常示例)CI结果为R:

t <- t.test(x, y)
cat(t$conf.int)