Question

试着理解这个dplyr的东西。我有一个排序数据框，我想根据变量进行分组。但是，需要构建这些组，使得每个组在分组变量上的最小总和为30。

考虑这个小例子数据框：

df1 <- matrix(data = c(05,0.9,95,12,0.8,31,
    16,0.8,28,17,0.7,10,
        23,0.8,11,55,0.6,9,
    56,0.5,12,57,0.2,1,
    59,0.4,1),
  ncol = 3,
  byrow = TRUE,
  dimnames = list(c(1:9), 
    c('freq', 'mean', 'count')
  )
)

现在，我想进行分组，以便count总和至少为30. freq然后将mean折叠为weighted.mean，其中权重为count值。freq mean count 5.00 0.90 95 12.00 0.80 31 16.26 0.77 38 45.18 0.61 34请注意，最后一个“bin”到第7行的总和为32，但由于第8：9行只加到2，我将它们添加到最后一个“bin”。

像这样：

dplyr

使用cf add-plugin-repo CF-Community https://plugins.cloudfoundry.org cf install-plugin blue-green-deploy -f -r CF-Community进行简单的总结不是问题，但我无法弄明白。我确实认为解决方案隐藏在某处：

Dynamic Grouping in R | Grouping based on condition on applied function

但如何将它应用于我的情况让我感到安心。

Answer 1

我希望我有一个更短的解决方案，但这是我想出来的。

首先我们定义一个自定义的cumsum函数：

cumsum2 <- function(x){
  Reduce(function(.x,.y){
    if(tail(.x,1)>30) x1 <- 0 else x1 <- tail(.x,1) ;c(.x,x1+.y)},x,0)[-1]
}
# cumsum2(1:10)
# [1]  1  3  6 10 15 21 28 36  9 19

然后我们可以享受dplyr链：

library(dplyr)
library(tidyr)

df1 %>%
  as.data.frame %>%                        # as you started with a matrix
  mutate(id = row_number(),                # we'll need this to sort in the end
         cumcount = cumsum2(count))    %>% # adding nex cumulate count
  `[<-`(.$cumcount < 30,"cumcount",NA) %>% # setting as NA values less than 30 ...
  fill(cumcount,.direction = "up")     %>% # ... in order to fill them with cumcount
  fill(cumcount,.direction = "down")   %>% # the last NAs belong to the last group so we fill down too
  group_by(cumcount)                   %>% # these are our new groups to aggregate freq and mean
  summarize(id = min(id),
            freq = sum(freq*count)/sum(count),
            mean = sum(mean*count)/sum(count)) %>%
  arrange(id)                          %>% # sort
  select(freq,mean,count=cumcount)         # and lay out as expected output

# # A tibble: 4 x 3
#       freq      mean count
#      <dbl>     <dbl> <dbl>
# 1  5.00000 0.9000000    95
# 2 12.00000 0.8000000    31
# 3 16.26316 0.7736842    38
# 4 45.17647 0.6117647    32

使用dplyr进行动态group_by

1 个答案: