在dplyr 0.5.0
中,在分组数据框上调用summarise
并不能保证任何结果行顺序(目前,它按组重新排序行,不确定它如何处理重复的分组级别)。
为了解决这个问题,我想用summarise(x = ...)
替换所有mutate(x = ...) %>% filter(row_number() == 1)
次操作。这样做有什么缺点或缺点吗?
两个操作的示例。
tmp_df <-
data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5)) %>%
group_by(group)
tmp_df %>%
summarise(b = sum(b))
tmp_df %>%
mutate(b = sum(b)) %>%
filter(row_number() == 1)
制造
> tmp_df %>%
+ summarise(b = sum(b))
# A tibble: 2 × 2
group b
<int> <dbl>
1 1 5
2 2 -5
> tmp_df %>%
+ mutate(b = sum(b)) %>%
+ filter(row_number() == 1)
Source: local data frame [2 x 2]
Groups: group [2]
group b
<int> <dbl>
1 2 -5
2 1 5
编辑:为了回应评论,为了便于阅读,我可以定义函数:
summarise_o <- function (.data, ...) {
# order preserving summarise
mutate_(.data, .dots = lazyeval::lazy_dots(...)) %>%
filter(row_number() == 1) %>%
return
}
然后简单地致电:
tmp_df %>%
summarise_o(b = sum(b))
答案 0 :(得分:2)
一种选择是将“群组”创建为factor
tmp_df <- data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5)) %>%
group_by(group = factor(group, levels = unique(group)))
tmp_df %>%
summarise(b = sum(b))
# A tibble: 2 x 2
# group b
# <fctr> <dbl>
#1 2 -5
#2 1 5