使用重复措施对数据进行应急测试

时间:2017-02-17 03:25:02

标签: r

我希望有人可以给我一些指导或帮助。我有一个数据集,其中包含已经过三年感染测试的人群。一些人,而不是所有人,都在一年多的时间内被抽样(因此他们代表了重复措施)。我想确定感染的流行程度是否随着时间的推移而变化,但我在确定适当的测试时遇到了麻烦。简单的偶然性测试违反了独立性的假设,因为多年来一直重复的个体。我不认为Cochran-Mantel-Haenszel测试或McNemar卡方测试是合适的,但如果我错了,请随时纠正我。这是我正在使用的数据集," AnID"变量是代表单个人的因素(因此,如果个人在多年内被抽样,您会看到该数字重复2或3次)。

我认为一个可行的选择是多次随机重新抽样数据(无需更换),每次只包括一次,并进行多年的应急测试。如果无差异的零假设至少在95%的时间内被拒绝,那么我可以可靠地声称存在差异。我还不够用r来为此编写我自己的代码。提前感谢您提供的任何帮助。

  

dput(实施例)   结构(列表(AnID =结构(c)(37L,37L,45L,45L,45L,55L,   55L,62L,62L,68L,68L,1L,1L,2L,3L,3L,4L,9L,9L,18L,   18L,18L,19L,19L,19L,20L,20L,21L,22L,22L,23L,24L,24L,   24L,25L,25L,25L,26L,27L,28L,28L,28L,29L,29L,29L,30L,   31L,32L,32L,33L,34L,35L,36L,38L,38L,39L,39L,40L,41L,   41L,42L,42L,42L,43L,43L,43L,44L,46L,46L,46L,47L,47L,   47L,48L,48L,48L,49L,49L,49L,50L,51L,52L,52L,53L,53L,   54L,54L,56L,56L,57L,57L,57L,58L,59L,60L,61L,63L,64L,   65L,66L,67L,69L,70L,71L,72L,73L,74L,74L,5L,6L,7L,   8L,10L,11L,12L,13L,14L,15L,16L,17L),。标签= c(" 10",   " 11"," 12"," 13"," 136"," 137"," 138& #34;," 139"," 14"," 140"," 141",   " 142"," 143"," 144"," 145"," 146"," 147& #34;," 26"," 27"," 28"," 29",   " 30"," 31"," 37"," 38"," 39"," 40& #34;," 41"," 42"," 43"," 44"," 45",   " 46"," 47"," 48"," 49"," 5"," 50& #34;," 51"," 52"," 53"," 57"," 58",   " 59"," 6"," 60"," 61"," 62"," 63& #34;," 64"," 65"," 66"," 67"," 69",   " 7"," 70"," 71"," 72"," 75"," 76& #34;," 77"," 8"," 82"," 83"," 84",   " 85"," 86"," 9"," 90"," 94"," 95& #34;," 96"," 97"," 98"),class =" factor"),       年=结构(c(1L,2L,1L,2L,3L,1L,2L,2L,3L,2L,       3L,2L,3L,2L,2L,3L,2L,2L,3L,1L,2L,3L,1L,2L,3L,       2L,3L,2L,1L,2L,2L,1L,2L,3L,1L,2L,3L,2L,2L,1L,       2L,3L,1L,2L,3L,2L,2L,2L,3L,2L,2L,2L,2L,2L,3L,       2L,3L,2L,2L,3L,1L,2L,3L,1L,2L,3L,2L,1L,2L,3L,       1L,2L,3L,1L,2L,3L,1L,2L,3L,2L,2L,1L,2L,1L,2L,       1L,2L,1L,2L,1L,2L,3L,2L,1L,1L,1L,1L,1L,1L,1L,       1L,1L,1L,1L,1L,1L,1L,2L,3L,3L,3L,3L,3L,3L,3L,       3L,3L,3L,3L,3L),。标签= c(" 2012"," 2013"," 2014"),class ="因子&#34),       value = c(" Pos"," Pos"," Pos"," Pos"," Pos",& #34; Neg"," Neg",       " Pos"," Pos"," Pos"," Pos"," Pos"," Pos& #34;," Neg"," Neg"," Pos",       " Neg"," Pos"," Pos"," Neg"," Pos"," Pos& #34;," Neg"," Neg"," Neg",       " Neg"," Neg"," Neg"," Pos"," Pos"," Pos& #34;," Pos"," Pos"," Pos",       " Neg"," Pos"," Pos"," Neg"," Neg"," Neg& #34;," Neg"," Pos"," Pos",       " Pos"," Pos"," Neg"," Neg"," Pos"," Pos& #34;," Neg"," Pos"," Neg",       " Pos"," Neg"," Neg"," Neg"," Neg"," Neg& #34;," Neg"," Neg"," Pos",       " Pos"," Pos"," Neg"," Pos"," Pos"," Neg& #34;," Neg"," Pos"," Neg",       " Neg"," Neg"," Neg"," Neg"," Neg"," Neg& #34;," Neg"," Pos"," Pos",       " Neg"," Neg"," Neg"," Pos"," Pos"," Pos& #34;," Pos"," Pos"," Neg",       " Neg"," Neg"," Pos"," Pos"," Neg"," Neg& #34;," Neg"," Neg"," Neg",       " Neg"," Pos"," Neg"," Neg"," Neg"," Neg& #34;," Neg"," Neg"," Neg",       " Pos"," Pos"," Neg"," Neg"," Neg"," Pos& #34;," Pos"," Pos"," Neg",       " Neg"," Pos"," Neg"," Pos"," Neg"))。。Name = c(" AnID","年",   " value"),row.names = 187:306,class =" data.frame")

1 个答案:

答案 0 :(得分:1)

请注意,实验/测试设计需要提前进行有效的样本量计算,以便最大限度地捕获具有统计显着性差异的可能性(如果存在)。 (有关详细信息,请参阅此处:https://en.wikipedia.org/wiki/Sample_size_determinationhttps://en.wikipedia.org/wiki/Statistical_power)。

如果您的所有用户都在受试者之前/之后(例如test / contol),您可以进行McNemar测试进行比例比较(参见此处:https://en.wikipedia.org/wiki/McNemar's_test)。

然而,并非所有用户都有重复测量,因此我选择为每个用户随机选择一年,因此我可以有3个独立的值样本。

假设dt是您的数据集...

library(dplyr)

set.seed(1)   # this will help you having a specific random sampling

dt %>%                      
  mutate(Pos = ifelse(value == "Pos", 1, 0)) %>%   # create a binary variable to flag positives
  group_by(AnID) %>%                               # for each user
  sample_n(1) %>%                                  # get one row/value randomly
  group_by(year) %>%                               # for each year
  summarise(N = n(),                               # get number of users
            N_Pos = sum(Pos),                      # get number of positive users
            Prc_Pos = mean(Pos)) %>%               # get percentage of positives
  print() -> tbl1                                  # print and save it

# # A tibble: 3 × 4
#     year     N N_Pos   Prc_Pos
#   <fctr> <int> <dbl>     <dbl>
# 1   2012    23     6 0.2608696
# 2   2013    27     9 0.3333333
# 3   2014    24    13 0.5416667

在观察每年的上述百分比后,您可以进行比例比较

# run the statistical comparison of proportions
prop.test(tbl1$N_Pos, tbl1$N)

# 3-sample test for equality of proportions without continuity correction
# 
# data:  tbl1$N_Pos out of tbl1$N
# X-squared = 4.3038, df = 2, p-value = 0.1163
# alternative hypothesis: two.sided
# sample estimates:
#    prop 1    prop 2    prop 3 
# 0.2608696 0.3333333 0.5416667 

这里的P值(0.1163)表明,我们没有任何证据证明这些年份在积极的可能性方面存在差异。

如果您发现差异,可以在年份之间进行成对比较。

# run pairwise comparisons 
pairwise.prop.test(tbl1$N_Pos, tbl1$N)

# Pairwise comparisons using Pairwise comparison of proportions 
# 
# data:  tbl1$N_Pos out of tbl1$N 
# 
# 1    2   
# 2 0.80 -   
# 3 0.29 0.45
# 
# P value adjustment method: holm 

这里的输出是3 p值(3对比较)。正如所料,所有这些都表明这些年份之间没有任何差异。

您可以在函数中使用上述过程并创建N个模拟。 检查您可以在多少次模拟中找到具有统计意义的结果。