如何使用data.table计算R中的人口流行率

时间:2014-03-07 21:45:15

标签: r aggregate data.table

我希望有人可以指出我如何计算人口流行率的正确方向,而不必在excel中做到这一点。

我目前正在开展一个项目,要求我找到按性别,年龄分层的 fruit 年度流行率。更复杂的是,独特的id可以从不同种类的 fruitgroups 中吃掉 fruit ,但是这四种水果中的两种属于同一个水果组,因此每个id只能计算一次。

为了进一步使问题复杂化,我需要通过

报告患病率
  • 人口中每千人每年的总果实,
  • 每1000个人每年的水果,
  • 每个有两个水果的水果组,
  • 按年龄
  • 按性别划分

这是我的数据集 - 您可能已经发现由于我的数据库的限制,我不得不更改实际值 - 特别感谢this reproduce code herethis answer on how to make your data anonymous

    #### Data ####
require(data.table) 
anonDT <- data.table(structure(list(id = c("E1998", "E2308", "E1421", "E676", "E5061","E4225", "E2600", "E658", "E2331", "E982", "E4790", "E408", "E1048","E3937", "E4554", "E3357", "E2637", "E178", "E3734", "E1217","E3771", "E1954", "E4928", "E3566", "E1106", "E3835", "E1505","E668", "E4083", "E3066", "E3356", "E4910", "E2801", "E1074","E5097", "E610", "E995", "E1001", "E3824", "E3427", "E3885","E648", "E1986", "E4777", "E2546", "E909", "E1954", "E634", "E2602","E531", "E67", "E2418", "E3863", "E4266", "E196", "E657", "E1516","E4722", "E3077", "E3732", "E1556", "E112", "E924", "E2801","E2742", "E3362", "E1880", "E3645", "E3357", "E2519", "E2450","E5162", "E1483", "E3846", "E4539", "E2452", "E282", "E4604","E226", "E5043", "E3909", "E88", "E51", "E1925", "E2776", "E3835","E4746", "E1631", "E4052", "E1128", "E220", "E1390", "E4908","E1385", "E1003", "E5181", "E3835", "E4910", "E3240", "E4380","E3357", "E963", "E706", "E5142", "E2869", "E3839", "E5271","E2584", "E194", "E4366", "E2621", "E932", "E1104", "E1964","E928", "E4377", "E1418", "E2940", "E3420", "E3958", "E4130","E790", "E3667", "E934", "E3356", "E5203", "E3835", "E3356","E3297", "E5203", "E4380", "E668", "E2856", "E4502", "E1054","E3644", "E4641", "E5204", "E2597", "E4432", "E2716", "E2422","E1964", "E1327", "E2028", "E2727", "E1868", "E638", "E88", "E4892","E706", "E5147", "E3130", "E4099", "E4239", "E341", "E593", "E4746","E2291", "E2240", "E2481", "E3846", "E2602", "E1673", "E4772","E2140", "E5024", "E1137", "E2182", "E4366", "E2386", "E648","E3118", "E8", "E2813", "E3422", "E3982", "E2", "E2940", "E2035","E4746", "E5134", "E4380", "E4615", "E1372", "E2249", "E1954","E2418", "E1122", "E3485", "E934", "E3611", "E2665", "E2961","E2108", "E4432", "E2447", "E3413", "E531", "E1751"),
                                sex = structure(c(1L,2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,1L, 2L, 1L, 2L, 1L, 1L, 1L), .Label = c("male", "female"), class = "factor"),group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 3L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,2L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 3L, 2L, 1L, 1L, 2L, 2L, 2L,2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L), .Label = c("fruitgr1","fruitgr2", "fruitgr3"), class = "factor"),
                                subgroup = structure(c(4L,3L, 3L, 3L, 3L, 4L, 4L, 4L, 2L, 4L, 3L, 3L, 4L, 3L, 3L, 3L,4L, 1L, 3L, 4L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 4L, 3L, 4L, 4L,3L, 4L, 1L, 4L, 3L, 3L, 4L, 2L, 1L, 4L, 3L, 3L, 4L, 3L, 3L,1L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 3L, 3L, 3L,4L, 3L, 4L, 4L, 3L, 4L, 3L, 3L, 4L, 3L, 1L, 3L, 1L, 4L, 2L,1L, 3L, 1L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 3L, 3L, 4L, 3L,4L, 3L, 4L, 3L, 2L, 1L, 3L, 4L, 2L, 3L, 4L, 3L, 4L, 3L, 4L,1L, 3L, 4L, 4L, 4L, 4L, 3L, 1L, 4L, 3L, 3L, 3L, 2L, 2L, 1L,4L, 4L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 3L, 2L, 3L, 2L, 4L,2L, 3L, 4L, 1L, 2L, 4L, 1L, 4L, 3L, 4L, 3L, 4L, 4L, 3L, 4L,2L, 1L, 2L, 3L, 1L, 1L, 4L, 3L, 3L, 4L, 1L, 3L, 4L, 4L, 4L,1L, 3L, 3L, 4L, 4L, 3L, 3L, 3L, 4L, 4L, 3L, 1L, 3L, 4L, 4L,4L, 3L, 4L, 4L, 1L, 1L, 4L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L,4L, 1L, 4L, 4L), .Label = c("apple", "orange", "banana","kiwi"), class = "factor"),
                                agegr = structure(c(2L, 3L, 2L,2L, 3L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 3L,1L, 2L, 2L, 3L, 2L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 2L,2L, 1L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 2L,3L, 3L, 2L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 1L, 2L, 3L, 2L, 1L, 3L, 1L,1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 2L,1L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 2L,2L, 2L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L,3L, 2L, 3L, 3L, 3L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,1L, 2L), .Label = c("19-24", "25-49", "50+"), class = "factor"),year = c(2004, 2007, 2008, 2006, 2008, 2007, 2008, 2007,2007, 2007, 2005, 2005, 2006, 2006, 2006, 2006, 2006, 2004,2008, 2006, 2006, 2007, 2006, 2006, 2005, 2006, 2005, 2005,2006, 2007, 2006, 2008, 2008, 2006, 2004, 2004, 2007, 2005,2008, 2008, 2005, 2007, 2008, 2008, 2008, 2008, 2005, 2008,2008, 2005, 2005, 2006, 2007, 2007, 2006, 2006, 2007, 2007,2008, 2008, 2005, 2007, 2007, 2005, 2008, 2007, 2004, 2008,2007, 2008, 2005, 2005, 2008, 2005, 2007, 2008, 2008, 2005,2004, 2008, 2004, 2007, 2005, 2008, 2004, 2008, 2008, 2006,2008, 2006, 2007, 2008, 2005, 2007, 2007, 2007, 2006, 2007,2007, 2008, 2006, 2005, 2008, 2004, 2008, 2008, 2006, 2005,2004, 2007, 2006, 2004, 2005, 2004, 2006, 2005, 2008, 2004,2007, 2004, 2006, 2008, 2006, 2007, 2008, 2005, 2007, 2007,2006, 2005, 2008, 2004, 2008, 2007, 2008, 2004, 2008, 2007,2007, 2004, 2004, 2008, 2004, 2007, 2008, 2005, 2007, 2005,2008, 2006, 2006, 2007, 2005, 2005, 2006, 2006, 2005, 2007,2006, 2008, 2005, 2006, 2008, 2007, 2005, 2006, 2006, 2007,2007, 2006, 2008, 2007, 2007, 2008, 2005, 2008, 2007, 2004,2005, 2006, 2007, 2006, 2008, 2006, 2008, 2004, 2006, 2008,2007, 2004, 2008, 2008, 2004, 2008, 2007, 2006, 2005, 2006,2004, 2004)), .Names = c("id", "sex", "group", "subgroup","agegr", "year"), class = c("data.table", "data.frame"), row.names = c(NA,-200L))) 

人口数据在这里:

#### Population data ####
populDT <- data.table(structure(list(year = 2004:2008, total = c(210696L, 216192L,223472L, 230629L, 233625L), men = c(104770L, 108390L, 113597L,117629L, 118275L), women = c(105926L, 107802L, 109875L, 113000L,115350L), agegrp1 = c(25721L, 25558L, 25933L, 27457L, 28083L), agegrp2 = c(104933L, 107935L, 111796L, 114852L, 115102L),agegrp3 = c(80042L, 82699L, 85743L, 88320L, 90440L)), .Names = c("year","total", "men", "women", "agegrp1", "agegrp2", "agegrp3"), sorted = "year", class = c("data.table","data.frame"), row.names = c(NA, -5L), key='year'))

我已设法将所有相关的流行计数纳入单个data.table

allcount <- anonDT[,.N, keyby=list(year,agegr,sex,group,subgroup,id)][,.N,by=list(year,agegr,sex,group,subgroup)]
allcount

以及制作几个data.tables子集,包括我需要的计数。

至于我的问题,

  1. 是否有一种简单的方法来计算所有相关分层和群体的数据中的流行比例(公式为1000 *吃水果/总人口的人)。
  2. 我需要将populDT与allcount合并吗?
  3. 如果是这样,最好的前进方式
  4. 作为一个额外的难题,从data.table创建发布表/图表的最佳方法是什么?
  5. 提前谢谢大家......因为这是我第一次发帖,希望我已经做好了一切:)

    这是几乎所需结果的数据表。我只是通过使用data.frame而不是表来设法让它处于这种状态。更确切地说,我为所有人创建了计数变量,然后使用聚合来使其变成形状

    ### first using max ### 
    prevAN <- aggregate(anon[,7:17], 
                  by = list(anon$year,anon$id,anon$group, anon$subgroup), 
                  FUN = max) 
    ### and then by sum ###
    prevAN1 <- aggregate(prevAN[,5:15], 
                   by = list(prevAN$year), 
                   FUN = sum)
    
    
    ### the count dataset ###
    DT2 <- data.table(structure(list(year = c(2004, 2005, 2006, 2007, 2008), count = c(865,1095, 1355, 1602, 1749), men = c(470, 616, 748, 863, 946), women = c(395,479, 607, 739, 803), agegr1 = c(141, 220, 272, 316, 385), agegr2 = c(552,657, 826, 1001, 1040), agegr3 = c(172, 218, 257, 285, 324), c_fruitgr2 = c(703,910, 1130, 1304, 1451), c_banana = c(153, 397, 618, 798, 950),c_kiwi = c(550, 513, 512, 506, 501), c_apple = c(121, 114,97, 110, 112), c_orange = c(41, 71, 128, 188, 186), total = c(210696L,216192L, 223472L, 230629L, 233625L), men.1 = c(104770L, 108390L,113597L, 117629L, 118275L),
                                 women.1 = c(105926L, 107802L,109875L, 113000L, 115350L), agegrp1 = c(25721L, 25558L, 25933L,27457L, 28083L), agegrp2 = c(104933L, 107935L, 111796L, 114852L,115102L), agegrp3 = c(80042L, 82699L, 85743L, 88320L, 90440L)), .Names = c("year", "count", "men", "women", "agegr1","agegr2", "agegr3", "c_fruitgr2", "c_banana", "c_kiwi", "c_apple","c_orange", "total", "men.1", "women.1", "agegrp1", "agegrp2","agegrp3"), sorted = "year", class = c("data.table", "data.frame"), row.names = c(NA, -5L), key='year')) 
    

    当我到达这个阶段时,我需要计算每年数字的患病率。我只需执行以下操作即可使用data.frame执行此操作:

    total.prev <- DF2$count*1000/DF2$total
    

    回到手头的问题,首先我喜欢data.table工作的速度有多快,即使我能够大致得到我需要使用聚合的数据,我想知道如何在数据中做到这一点因为我觉得它更快更有成效。

    其次存在计算问题,在data.table或data.tables之间进行计算。

    第三,出版图表的问题可能是第二篇文章。

    更新 我已经想出了一种通过执行以下操作来计算data.table中的prevelance的方法

    DT2[,':='(p1= count*1000/total,
            pM= men*1000/men.1,
            pW= women*1000/women.1,
            pA1= agegr1*1000/agegrp1,
            pA2= agegr2*1000/agegrp2,
            pA3= agegr3*1000/agegrp3,
            pFgr2= c_fruitgr2*1000/total,
            pB= c_banana*1000/total,
            pK= c_kiwi*1000/total,
            pA= c_apple*1000/total,
            pO= c_orange*1000/total),]
    
    DT2<-round(DT2, 2)  # to round the prevalence numbers 
    

    如果某人有更简单的东西可以提供:)

0 个答案:

没有答案