自动计算数据框的摘要统计信息并创建新表

时间:2017-09-15 14:37:58

标签: r dplyr

我有以下数据框:

col1 <- c("avi","chi","chi","bov","fox","bov","fox","avi","bov",
          "chi","avi","chi","chi","bov","bov","fox","avi","bov","chi")
col2 <- c("low","med","high","high","low","low","med","med","med","high",
          "low","low","high","high","med","med","low","low","med")
col3 <- c(0,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0)

test_data <- cbind(col1, col2, col3)
test_data <- as.data.frame(test_data)

我想最终得到类似这个表的东西(值是随机的):

Species  Pop.density  %Resistance  CI_low  CI_high   Total samples
avi      low          2.0          1.2     2.2       30
avi      med          0            0       0.5       20
avi      high         3.5          2.9     4.2       10
chi      low          0.5          0.3     0.7       20
chi      med          2.0          1.9     2.1       150
chi      high         6.5          6.2     6.6       175

%电阻列基于上面的col3,其中1 =抗性,0 =非抗性。我尝试过以下方法:

library(dplyr)
test_data<-test_data %>%
  count(col1,col2,col3) %>%
  group_by(col1, col2) %>%
  mutate(perc_res = prop.table(n)*100)

我尝试了这个,它几乎可以解决这个问题,因为我得到col3中的总1和0的百分比,对于col1和2中的每个值,但总样本是错误的,因为我计算所有三列,当正确的计数仅为col1和2时。

对于置信区间,我将使用以下内容:

binom.test(resistant samples,total samples)$conf.int*100

但是我不知道如何与其他人一起实施。 有一种简单快捷的方法吗?

2 个答案:

答案 0 :(得分:6)

我会......

library(data.table)
setDT(DT)

DT[, { 
  bt <- binom.test(sum(resists), .N)$conf.int*100
  .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)
}, keyby=.(species, popdens)]

    species popdens  res_rate    res_lo    res_hi n
 1:     avi     low   0.00000  0.000000  70.75982 3
 2:     avi     med   0.00000  0.000000  97.50000 1
 3:     bov     low 100.00000 15.811388 100.00000 2
 4:     bov     med  50.00000  1.257912  98.74209 2
 5:     bov    high 100.00000 15.811388 100.00000 2
 6:     chi     low   0.00000  0.000000  97.50000 1
 7:     chi     med  50.00000  1.257912  98.74209 2
 8:     chi    high  66.66667  9.429932  99.15962 3
 9:     fox     low   0.00000  0.000000  97.50000 1
10:     fox     med  50.00000  1.257912  98.74209 2

包括所有级别(物种和人口密度的组合)......

DT[CJ(species = species, popdens = popdens, unique = TRUE), on=.(species, popdens), {
  bt <- 
    if (.N > 0L) binom.test(sum(resists), .N)$conf.int*100 
    else NA_real_
  .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)    
}, by=.EACHI]

    species popdens  res_rate    res_lo    res_hi n
 1:     avi     low   0.00000  0.000000  70.75982 3
 2:     avi     med   0.00000  0.000000  97.50000 1
 3:     avi    high        NA        NA        NA 0
 4:     bov     low 100.00000 15.811388 100.00000 2
 5:     bov     med  50.00000  1.257912  98.74209 2
 6:     bov    high 100.00000 15.811388 100.00000 2
 7:     chi     low   0.00000  0.000000  97.50000 1
 8:     chi     med  50.00000  1.257912  98.74209 2
 9:     chi    high  66.66667  9.429932  99.15962 3
10:     fox     low   0.00000  0.000000  97.50000 1
11:     fox     med  50.00000  1.257912  98.74209 2
12:     fox    high        NA        NA        NA 0

工作原理

语法为DT[i, j, by=]其中......

  • i确定行的子集,有时使用辅助参数on=roll=
  • by=确定子集表中的组,如果排序则切换为keyby=
  • j是代表每个群组的代码。

j应评估为列表,.()list()的快捷方式。有关详细信息,请参阅?data.table

使用的数据

(重命名列,重新格式化二进制变量回到0/1或false / true,按正确顺序设置人口密度级别):

DT = data.frame(
  species = col1, 
  popdens = factor(col2, levels=c("low", "med", "high")), 
  resists = col3
)

答案 1 :(得分:3)

应该这样做。

library(tidyverse)
library(broom)

test_data %>%
  mutate(col3 = ifelse(col3 == 0, "NonResistant", "Resistant")) %>%
  count(col1, col2, col3) %>%
  spread(col3, n, fill = 0) %>%
  mutate(PercentResistant = Resistant / (NonResistant + Resistant)) %>%
  mutate(test = map2(Resistant, NonResistant, ~ binom.test(.x, .x + .y) %>% tidy())) %>%
  unnest() %>%
  transmute(Species = col1, Pop.density = col2, PercentResistant, CI_low = conf.low * 100, CI_high = conf.high * 100, TotalSamples = Resistant + NonResistant)
  1. 改变0/1阻力列,使其具有可读值。
  2. 计算每个存储桶中的值。
  3. 将col3 / n传播到两列Resistant / NonResistant中,并将count(n)放入这些列中。现在每行都有测试所需的一切。
  4. 计算百分比阻力
  5. 对每个存储桶执行测试,并将结果放在名为test的嵌套框架中。
  6. 取消测试数据框,以便您可以使用测试结果。
  7. 清理,给一切好名字。
  8. 结果

    enter image description here