为什么group_by和group_by_在用两个变量进行汇总时会给出不同的答案?

时间:2016-11-08 19:53:04

标签: r dplyr

在以下示例中,我想通过两个变量创建摘要统计信息。当我使用dplyr::group_by执行此操作时,我会得到正确的答案,当我使用dplyr::group_by_时,它会比我想要的更多地总结一个级别。

library(dplyr)
set.seed(919)
df <- data.frame(
  a = c(1, 1, 1, 2, 2, 2),
  b = c(3, 3, 4, 4, 5, 5),
  x = runif(6)
)

# Gives correct answer
df %>%
  group_by(a, b) %>%
  summarize(total = sum(x))

# Source: local data frame [4 x 3]
# Groups: a [?]
# 
#       a     b     total
#   <dbl> <dbl>     <dbl>
# 1     1     3 1.5214746
# 2     1     4 0.7150204
# 3     2     4 0.1234555
# 4     2     5 0.8208454

# Wrong answer -- too many levels summarized
df %>%
  group_by_(c("a", "b")) %>%
  summarize(total = sum(x))
# # A tibble: 2 × 2
#       a     total
#   <dbl>     <dbl>
# 1     1 2.2364950
# 2     2 0.9443009

发生了什么?

1 个答案:

答案 0 :(得分:4)

如果要使用变量名称向量,可以将其传递给.dots参数:

df %>%
      group_by_(.dots = c("a", "b")) %>%
      summarize(total = sum(x))

#Source: local data frame [4 x 3]
#Groups: a [?]

#      a     b     total
#  <dbl> <dbl>     <dbl>
#1     1     3 1.5214746
#2     1     4 0.7150204
#3     2     4 0.1234555
#4     2     5 0.8208454

或者您可以像在NSE中那样使用它:

df %>%
     group_by_("a", "b") %>%
     summarize(total = sum(x))

#Source: local data frame [4 x 3]
#Groups: a [?]

#      a     b     total
#  <dbl> <dbl>     <dbl>
#1     1     3 1.5214746
#2     1     4 0.7150204
#3     2     4 0.1234555
#4     2     5 0.8208454