按包含级别计算平均值 - 基于整个组扩展窗口

时间:2021-03-20 10:48:34

标签: r dplyr tidyverse

我有一些值('value')和一个分组变量('level')的数据:

# Sample data frame
df <- data.frame(value = c(1,0,1,0,1,1,1,1,0,0),
                 level = c("A2","A3","A1","A2","A1","A3","A2","A1","A3","A2"))

    value level
1      1     A2
2      0     A3
3      1     A1
4      0     A2
5      1     A1
6      1     A3
7      1     A2
8      1     A1
9      0     A3
10     0     A2

从概念上讲,我假设“A1”是最低级别,“A3”是最高级别。我想通过在计算中包含越来越多的组来计算平均值,例如“扩展窗口”,但基于整个组,而不是单个行。因此,对于每个组,计算将包括焦点组的值所有其他较低级别组的值。

我希望包含级别“A2”和“A3”。这意味着当焦点级别为“A2”时,计算包括“A2”组中的值和较低级别“A1”中的值。在计算“A3”的平均值时,计算包括焦点级别“A3”中的所有值,以及所有更低级别,即“A2”和“A1”组中的值。 >

我的目标是有这样的输出:

  MeanValue  level
1      1.00     A1 # mean of level A1
2      0.71     A2 # mean of level A2 and A1
3      0.60     A3 # mean of level A3, A2, and A1

有人可以告诉我如何编码吗?我真的很感激任何帮助!

3 个答案:

答案 0 :(得分:4)

使用 dplyrpurrr 的一个选项可能是:

df %>%
 arrange(level) %>%
 mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
 group_by(level) %>%
 summarise(mean_value = last(mean_value))

  level mean_value
  <fct>      <dbl>
1 A1         1    
2 A2         0.714
3 A3         0.6  

答案 1 :(得分:2)

这里有两个 data.table 替代方案。如上所述,计算了累积平均值,但这里使用了 basedata.table 函数。

setDT(df)[order(level), .(level = unique(level),
                          mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]

setDT(df)[order(level), .(level = unique(level),
                          mean_value = (cumsum(value) / .I)[rev(rowid(rev(level))) == 1])]

#    level mean_value
# 1:    A1  1.0000000
# 2:    A2  0.7142857
# 3:    A3  0.6000000

虽然速度对 OP 来说不是问题,但对于未来的访问者来说,值得注意的是,即使在一个适度的数据集上,200 个组,每个组 200 行,这些替代方案都比 tidyverse 代码快得多。

为基准准备数据,200 组,每组 200 行:

ng = 2e2
n = 2e2

set.seed(1)
df = data.frame(value = rbinom(ng * n, 1, 0.5),
                level = rep(seq(ng), each = n)

df3 = df
df4 = df

基准:

microbenchmark(
  f1 = {
    d1 = df %>%
      arrange(level) %>%
      mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
      group_by(level) %>%
      summarise(mean_value = last(mean_value))
  },
  
  f2 = {
    d2 = df %>%
      arrange(level) %>%
      summarise(mean_value = cummean(value)[!duplicated(level, fromLast = TRUE)], 
                level = unique(level)) %>%
      select(level, mean_value)
  },
  
  f3 = {
    d3 = setDT(df3)[order(level), .(level = unique(level), mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]
  },
  
  f4 = {
    d4 = setDT(df4)[order(level), data.table(level = level, mean_value = cumsum(value) / .I)[rev(rowid(rev(level))) == 1]]
  },
  times = 10L)

结果:

# Unit: milliseconds
# expr         min          lq        mean      median          uq         max neval
#   f1 2834.601301 2869.658601 3052.233971 3030.448801 3191.671301 3303.532801    10
#   f2  127.474801  130.049101  138.202461  135.293550  141.301601  173.230901    10
#   f3    2.929702    3.401802    3.706061    3.450152    3.838801    5.585200    10
#   f4    3.888700    3.984301    4.372161    4.166951    4.844901    5.198001    10

all.equal(as.data.frame(d1), d2)
all.equal(as.data.frame(d1), as.data.frame(d3))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d5))
# TRUE

另一种方法是非对等连接。在这种特殊情况下有点慢,与累积平均技巧竞争,但它更通用,可以处理更复杂的函数。

setDT(df)[ , level := factor(level)]
df[.(level = sort(unique(df$level))), on = .(level <= level),
   .(mean_value = mean(value)), by = .EACHI]

答案 2 :(得分:1)

一个选项是使用 cummean

library(dplyr)
df %>%
   arrange(level) %>% 
   summarise(mean_value = cummean(value)[!duplicated(level, 
       fromLast = TRUE)], level = unique(level)) %>% 
   select(level, mean_value)
#  level mean_value
#1    A1  1.0000000
#2    A2  0.7142857
#3    A3  0.6000000