试着理解这个dplyr
的东西。我有一个排序数据框,我想根据变量进行分组。但是,需要构建这些组,使得每个组在分组变量上的最小总和为30。
考虑这个小例子数据框:
df1 <- matrix(data = c(05,0.9,95,12,0.8,31,
16,0.8,28,17,0.7,10,
23,0.8,11,55,0.6,9,
56,0.5,12,57,0.2,1,
59,0.4,1),
ncol = 3,
byrow = TRUE,
dimnames = list(c(1:9),
c('freq', 'mean', 'count')
)
)
现在,我想进行分组,以便count
总和至少为30. freq
然后将mean
折叠为weighted.mean
,其中权重为count
值。freq mean count
5.00 0.90 95
12.00 0.80 31
16.26 0.77 38
45.18 0.61 34
请注意,最后一个“bin”到第7行的总和为32,但由于第8:9行只加到2,我将它们添加到最后一个“bin”。
像这样:
dplyr
使用cf add-plugin-repo CF-Community https://plugins.cloudfoundry.org
cf install-plugin blue-green-deploy -f -r CF-Community
进行简单的总结不是问题,但我无法弄明白。我确实认为解决方案隐藏在某处:
Dynamic Grouping in R | Grouping based on condition on applied function
但如何将它应用于我的情况让我感到安心。
答案 0 :(得分:2)
我希望我有一个更短的解决方案,但这是我想出来的。
首先我们定义一个自定义的cumsum函数:
cumsum2 <- function(x){
Reduce(function(.x,.y){
if(tail(.x,1)>30) x1 <- 0 else x1 <- tail(.x,1) ;c(.x,x1+.y)},x,0)[-1]
}
# cumsum2(1:10)
# [1] 1 3 6 10 15 21 28 36 9 19
然后我们可以享受dplyr
链:
library(dplyr)
library(tidyr)
df1 %>%
as.data.frame %>% # as you started with a matrix
mutate(id = row_number(), # we'll need this to sort in the end
cumcount = cumsum2(count)) %>% # adding nex cumulate count
`[<-`(.$cumcount < 30,"cumcount",NA) %>% # setting as NA values less than 30 ...
fill(cumcount,.direction = "up") %>% # ... in order to fill them with cumcount
fill(cumcount,.direction = "down") %>% # the last NAs belong to the last group so we fill down too
group_by(cumcount) %>% # these are our new groups to aggregate freq and mean
summarize(id = min(id),
freq = sum(freq*count)/sum(count),
mean = sum(mean*count)/sum(count)) %>%
arrange(id) %>% # sort
select(freq,mean,count=cumcount) # and lay out as expected output
# # A tibble: 4 x 3
# freq mean count
# <dbl> <dbl> <dbl>
# 1 5.00000 0.9000000 95
# 2 12.00000 0.8000000 31
# 3 16.26316 0.7736842 38
# 4 45.17647 0.6117647 32