对分组数据框执行mutate + filter vs summary是否有任何缺点?

时间:2017-05-26 04:19:27

标签: r dplyr

dplyr 0.5.0中,在分组数据框上调用summarise并不能保证任何结果行顺序(目前,它按组重新排序行,不确定它如何处理重复的分组级别)。

为了解决这个问题,我想用summarise(x = ...)替换所有mutate(x = ...) %>% filter(row_number() == 1)次操作。这样做有什么缺点或缺点吗?

两个操作的示例。

tmp_df <- 
    data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5)) %>%
    group_by(group)

tmp_df %>%
    summarise(b = sum(b))

tmp_df %>%
    mutate(b = sum(b)) %>%
    filter(row_number() == 1)
制造

> tmp_df %>%
+     summarise(b = sum(b))
# A tibble: 2 × 2
  group     b
  <int> <dbl>
1     1     5
2     2    -5
> tmp_df %>%
+     mutate(b = sum(b)) %>%
+     filter(row_number() == 1)
Source: local data frame [2 x 2]
Groups: group [2]

  group     b
  <int> <dbl>
1     2    -5
2     1     5

编辑:为了回应评论,为了便于阅读,我可以定义函数:

summarise_o <- function (.data, ...) {
    # order preserving summarise
    mutate_(.data, .dots = lazyeval::lazy_dots(...)) %>%
        filter(row_number() == 1) %>% 
        return
}

然后简单地致电:

tmp_df %>%
    summarise_o(b = sum(b))

1 个答案:

答案 0 :(得分:2)

一种选择是将“群组”创建为factor

tmp_df <- data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5)) %>%
             group_by(group = factor(group, levels = unique(group)))

tmp_df %>%
    summarise(b = sum(b))
# A tibble: 2 x 2
#    group     b
#   <fctr> <dbl>
#1      2    -5
#2      1     5