Question

我已经使用随机生成的身高和体重值模拟了1000次线性模型，并将每个参与者随机分配为治疗或未治疗（因子为1和0）。假设模型是：

lm(bmi~height + weight + treatment, data = df)

我正在为以下问题而奋斗：

现在，该模型需要针对1000个重复样本中的每个样本，以10为步长在300至500之间循环，并存储p值小于0.05的模拟实验的比例，以便估计可以检测的功效两个治疗组之间的bmi的变化为0.5，显着性水平为

完成上述操作后，我需要创建一个最能描述x轴上的样本大小和y轴上的估计功效的图形，并反映最小的样本量以实现80％的功效估计用不同的颜色。

任何想法如何以及从这里去哪里？

谢谢，克里斯

Answer 1

我会这样做：

library(dplyr)
library(ggplot2)

# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
  # use assumed data generating process to simulate data and add error
  data <- tibble(height = rnorm(n, 69, 0.1), 
                 weight = rnorm(n, 197.8, 1.9), 
                 treatment = sample(c(0, 1), n, replace = TRUE),
                 error = rnorm(n, sd = 1.75),
                 bmi = 703 * weight / height^2 + 0.5 * treatment + error)

  # model the data
  mdl <- lm(bmi ~ height + weight + treatment, data = data)

  # extract p-value for treatment effect
  summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}

# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
  p_values <- replicate(r, do_simulate(n))
  power <- mean(p_values < alpha)
  return(c(n, power))
}

# third, estimate power at each of your desired 
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ], 
                    power = mx[2, ])

# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>% 
  filter(power > 0.80) %>% 
  top_n(-1, n) %>% 
  pull(n) -> min_n

# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) + 
  geom_smooth(method = "loess", se = FALSE) + 
  geom_vline(xintercept = min_n)

绘制具有样本量和功率估计值的图形

1 个答案: