lm(bmi~height + weight + treatment, data = df)
我正在为以下问题而奋斗:
现在,该模型需要针对1000个重复样本中的每个样本,以10为步长在300至500之间循环,并存储p值小于0.05的模拟实验的比例,以便估计可以检测的功效两个治疗组之间的bmi的变化为0.5,显着性水平为
完成上述操作后,我需要创建一个最能描述x轴上的样本大小和y轴上的估计功效的图形,并反映最小的样本量以实现80%的功效估计用不同的颜色。
任何想法如何以及从这里去哪里?
谢谢, 克里斯
答案 0 :(得分:0)
我会这样做:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)