当某些组具有NA值时,dplyr::mutate()
可能会返回不正确的结果,具体取决于所考虑的组。在粗略搜索之后,我没有在StackOverflow上找到类似的问题,尽管它可能与dplyr问题#1545有关。对此模式的任何解释都将非常感激。
以下示例计算通过mutate
的组平均值。 (使用summarize
代替mutate
不会导致此问题。)
library(dplyr)
#Sample function: calculate group means
foo <- function(data, groups) {
data %>%
filter(Group %in% groups) %>%
group_by(Group) %>%
mutate(mean = mean(value, na.rm = TRUE))
}
set.seed(1)
df <- data.frame(Group = rep(1:4, each = 2),
value = c(rep(NA, 2), sample(1:10, 6, replace = TRUE)))
df
# Group value
#1 1 NA
#2 1 NA
#3 2 3
#4 2 4
#5 3 6
#6 3 10
#7 4 3
#8 4 9
在此data.frame中,组1由NA值组成。在组1上运行foo
应该产生NA值,而组2,3,4应该分别产生3.5,8,6。但是,实际结果取决于包含的组:
foo(df, 1:4) #group 3 fails
# Group value mean
# (int) (int) (dbl)
#1 1 NA NA
#2 1 NA NA
#3 2 3 3.5
#4 2 4 3.5
#5 3 6 NA #<-- should be 8
#6 3 10 NA #<-- should be 8
#7 4 3 6.0
#8 4 9 6.0
foo(df, 2:4) #correct
foo(df, 3:1) #group 3 fails
foo(df, c(1,3)) #group 3 fails
foo(df, c(2,3)) #correct
foo(df, c(3,4)) #correct
foo(df, c(1,2)) #group 2 fails
foo(df, c(1,3,4)) #correct
foo(df, c(1,2,4)) #correct
在测试具有不同NA值排序的各种类型的df
之后,看起来当早期组具有NA值时,不为某些组计算平均值。但为什么输出的差异呢?例如,在上述示例中,组3倾向于失败,而成功计算组2的平均值。
sessionInfo()
#R version 3.1.3 (2015-03-09)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows 7 x64 (build 7601) Service Pack 1
#
#locale:
# [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
#[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
#[5] LC_TIME=English_United States.1252
#
#attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
#other attached packages:
# [1] dplyr_0.4.3
#
#loaded via a namespace (and not attached):
# [1] assertthat_0.1 DBI_0.4 magrittr_1.5 parallel_3.1.3 R6_2.1.2
#[6] Rcpp_0.12.4 rsconnect_0.4.3 tools_3.1.3