Question

我对在R中进行比例测试感到有些困惑。也许这很明显，但是prop.test的行为与我预期的不同，我想知道为什么以及使用什么。该应用程序位于抗议事件的数据集上。

我构建了以下数据集：

，其中名称是指所计算事件百分比的类型。第一行是指选举后组织的活动（aft_elect_prt）。在此类别的每个类别中，我都计算已（past_pm1）或未与前总理组（past_pm0）关联的事件数。总数是指特定类型的数据集中的事件数。 share0为past_pm0 /总计，share1为past_pm1 /总计。

我想检验零假设，即两份股票之间没有统计学上的显着差异。

阅读prop.test的文档后，我将其设置为：

prop.test(x = as.numeric(subseted$past_pm1),
          n = subseted$total,
          p = subseted$share0,
          alternative = "two.sided",
          conf.level = 0.95)

但是，这显然不能测试我想要的东西。它也只导致一个p值，而我想为每一行提取一个p值。我应该改用什么功能/测试？

这是数据集的Dput代码：

structure(list(names = c("aft_elect_prt", "ANSM", "bef_elect_prt", 
"big_event", "conf_viol", "coorg", "demo_petition", "economic", 
"NSM", "political"), past_pm0 = c(49.66101, 78.54659, 65.57226, 
49.67205, 39.641924, 69.52704, 286.8565, 68.53114, 100.00488, 
117.97347), past_pm1 = c(33.796, 14.30855, 34.40608, 31.14065, 
9.017051, 30.64896, 120.4515, 20.86095, 19.00836, 71.24065), 
    total = c(83.4570157825947, 92.8551414906979, 99.9783371835947, 
    80.8127028793097, 48.6589741557837, 100.176002234221, 407.307988807559, 
    89.3920872062445, 119.013234868646, 189.21411934495), share0 = c(0.595048954654295, 
    0.8459045857775, 0.655864678761227, 0.614656461548911, 0.814688856223823, 
    0.69404885850245, 0.704274180429913, 0.766635416419863, 0.84028368870382, 
    0.623491895892433), share1 = c(0.404950976057405, 0.154095398168484, 
    0.344135349408928, 0.385343502821669, 0.185311161125829, 
    0.305951119194593, 0.295725847049147, 0.233364614832964, 
    0.159716354412006, 0.376508107569518)), row.names = c(NA, 
-10L), class = "data.frame")

我很感谢任何提示！

Answer 1

prop.test函数未向量化。它进行一次测试。您需要将函数显式映射到数据框的每一行。您可以为此使用base R函数或tidyverse函数。使用purrr::pmap在数据框的每一行中应用函数，这就是您在tidyverse中做的事情。

library(purrr)
prop_test_list <- pmap(subseted, function(past_pm1, total, ...) prop.test(x = past_pm1, n = total, alternative = 'two.sided', conf.level = 0.95))

这将返回测试对象的列表，其中包含与数据框中的行数相同的元素。

要以数据框形式从列表中提取输出，可以使用purrr::map_dfr。这是一个带有一些摘要统计信息的示例：

map_dfr(prop_tests, ~ data.frame(p = .x$p.value, estimate = .x$estimate, confint_min = .x$conf.int[1], confint_max = .x$conf.int[2]))

输出：

   p            estimate   confint_min confint_max
1  1.037002e-01 0.4049510  0.30058839   0.5181435
2  5.288024e-11 0.1540954  0.09038891   0.2472255
3  2.553365e-03 0.3441353  0.25382739   0.4465844
4  5.115352e-02 0.3853435  0.28114139   0.5005436
5  2.167205e-05 0.1853112  0.09330970   0.3274424
6  1.540307e-04 0.3059511  0.21985913   0.4071514
7  2.490965e-16 0.2957258  0.25231710   0.3430569
8  7.967215e-07 0.2333646  0.15312169   0.3369412
9  2.252910e-13 0.1597164  0.10130585   0.2407265
10 8.851678e-04 0.3765081  0.30807997   0.4500369

Answer 2

基本函数Vectorize可以向量化不接受向量的函数。注意SIMPLIFY参数。如果使用默认值TRUE，则结果将简化为向量，数组或矩阵。在这里，将其保留为列表更有意义。

vprop.test <- Vectorize(prop.test, SIMPLIFY = FALSE)
ans <- with(subseted, vprop.test(x = past_pm1, n = total))

要提取正好的p值（注释中均为0）并将其附加到原始数据框中：

subseted$p.value <- sapply(ans, "[[", "p.value")

测试R

2 个答案: