"综述"查询与dplyr vs plyr

时间:2016-07-08 10:19:22

标签: r dplyr plyr

这是我的第一个问题,所以如果我没有提出正确的提问方案,我会道歉。

我将R中的dplyr和plyr进行比较,以汇总数据框中的数据。

数据框很简单。我有一种药物,一组患者,每个患者都有一组反应,包括样本和数值,或该样本中的药物水平。

我正在执行的操作我正在总结level,即患者对药物的反应,对sample和{{1}的每个组合的每个patient进行总结}。

两个图书馆的总结操作给出了不同的答案。 Plyr看起来很正确。第三行度量标准中的总和不应为NA,因为此子集中没有drug s。 Plyr匹配我为这个组手动计算的东西。

知道发生了什么事吗?我希望这与dplyr在第一位患者中处理NA的方式有关" AB"在总结步骤中。

可再现的示例

NAs

获得的结果

library(plyr)
library(dplyr)

panel <- structure(list(drug = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                  1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Paracetamol", class = "factor"), 
               patient = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                                     2L, 3L, 3L, 3L, 3L, 3L), .Label = c("AB", "AC", "AD"), class = "factor"), 
               sample = structure(c(6L, 8L, 12L, 9L, 5L, 3L, 9L, 1L, 
                                         2L, 11L, 7L, 10L, 2L, 13L, 4L), 
                                       .Label = c("AH", "AT", "BV", 
                                        "CD", "CK", "CM", "CU", "CV", "CZ", "DK", "DM", "DN", "DO"
                                         ), class = "factor"), 
               level = c(NaN, NaN, NaN, NaN, NaN, 
                         0.00153937362708914, 0.000136048826793052, 0.0589067431555789, 
                         0.00798507232520125, 0.000179913435935396, 0.00338149695926075, 
                         0.000365122058519732, 0.0138121831347925, 0.000309530166151126, 
                         0.00518926294072875)), .Names = c("drug", "patient", "sample_type", 
                         "level"), row.names = c(NA, -15L), class = "data.frame")

plyr_version <- ddply(panel,
                      .(drug, patient),
                      mutate,
                      sum_level = sum(level)) %>%
  ddply(.(drug, patient), summarise, metric = sum(sum_level))

dplyr_version <- group_by(panel, drug, patient) %>%
  mutate(sum_level = sum(level)) %>% 
  summarise(metric = sum(sum_level))

print("Plyr\n")
print.data.frame(plyr_version)

print("Dplyr")
print.data.frame(dplyr_version)

如果我在sum_level步骤中使用[1] "Plyr\n" drug patient metric 1 Paracetamol AB NaN 2 Paracetamol AC 0.3437358 3 Paracetamol AD 0.1152880 [1] "Dplyr" drug patient metric 1 Paracetamol AB NA 2 Paracetamol AC 0.3437358 3 Paracetamol AD NA ,则结果匹配, 即na.rm = TRUE,给予:

sum_level = sum(level, na.rm = TRUE))

已修改 - 已添加sessionInfo

[1] "Plyr"
         drug patient    metric
1 Paracetamol      AB 0.0000000
2 Paracetamol      AC 0.3437358
3 Paracetamol      AD 0.1152880
[1] "Dplyr"
         drug patient    metric
1 Paracetamol      AB 0.0000000
2 Paracetamol      AC 0.3437358
3 Paracetamol      AD 0.1152880

0 个答案:

没有答案