R (dplyr) 中的置信区间计算与权重

时间:2021-05-19 03:30:38

标签: r dplyr statistics data-wrangling weighted

我有以下问题:我正在尝试使用权重(下限和上限)计算分类数据的 95% CI 范围。我有 0 或 1 的“响应”变量,我有 2 个因素(旧与新,以及 14 种不同颜色)。已收集每个参与者的权重。这是我的数据集:

dd <- data.frame(
  weight = c(0.0037, 0.0016, 0.0347, 0.3421, 0.1047, 0.0065, 0.0153, 0.2856, 0.0032, 0.0321, 0.0321, 0.0321, 0.0321, 0.0321),
  factor1 = factor(c("New", "Old", "New", "New", "New", "Old", "New", "Old", "New", "Old", "Old", "New", "Old", "Old")),
  factor2 = factor(c("Red", "Yellow", "Green", "Orange", "Brown", "Blue", "Black", "White", "Purple", "Gray", "Pink", "Navy", "Tan", "Fuscia")),
  question = factor(c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14")),
  response = c(0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L)
)

我想为每个问题生成带有加权百分比的条形图,对于两个因素,如下所示:

question    factor1    factor2      Weighted_pct   Lower 95%CI      Upper 95%CI
Q1           New         Red        21.3%          21.1%            22.3%
Q1           Old         Red        20.4%          20.1%            21%
Q1           New         Red        22.6%          20.5%            22.3%
Q1           Old         Red        19.6%          20.7%            22.3%
Q1           New         Red        11.8%          20.9%            22.3%
Q1           Old         Red        18.3%          20.4%            22.3%
Q1           New         Red        27.4%          20.6%            22.3%
Q1           Old         Red        6.3%           11%              22.3%
Q1           New         Red        32.0%          20.3%            22.3%
Q1           Old         Red        7.7%           20.3%            22.3%
Q1           New         Red        9.3%           20.3%            22.3%
Q1           Old         Red        15.3%          20.3%            22.3%
Q1           New         Red        22.1%          20.3%            22.3%
Q1           Old         Yellow     3.3%           20.3%            22.3%

数字是任意的,但我希望这是有道理的。我可以只使用一个因素让它工作,但是一旦我引入第二个因素,我的代码就会崩溃。

我当前的代码:

new_data <- complete_data %>%
dplyr::select(dplyr::Question, factor1, factor2, Response, weight) %>%
tidyr::gather(key = Question, value = Response, -weight, -factor1, -factor2) %>%
dplyr::group_by (Question, factor1, factor2) %>%
dplyr::summarise (Unweighted_N = n(),
Unweighted_pct = sum(Question) / Unweighted_N,
Effective_N = sum(weight),
Weighted_pct = sum (Question*weight)/Effective_N,
N_yes = round(Weighted_pct*Effective_N),
Low_95CI = binom.test(N_yes, round(Effective_N), conf.level = 0.95)$conf.int[1],
Up_96CI = binom.test(N=yes, round(Effective_N), conf.level = 0.95)$conf.int[2]) %>%
dplyr::select (Question,
factor 1,
factor 2,
Weighted_pct,
Low_95CI,
Up_95CI)

0 个答案:

没有答案