我有一些值('value')和一个分组变量('level')的数据:
# Sample data frame
df <- data.frame(value = c(1,0,1,0,1,1,1,1,0,0),
level = c("A2","A3","A1","A2","A1","A3","A2","A1","A3","A2"))
value level
1 1 A2
2 0 A3
3 1 A1
4 0 A2
5 1 A1
6 1 A3
7 1 A2
8 1 A1
9 0 A3
10 0 A2
从概念上讲,我假设“A1”是最低级别,“A3”是最高级别。我想通过在计算中包含越来越多的组来计算平均值,例如“扩展窗口”,但基于整个组,而不是单个行。因此,对于每个组,计算将包括焦点组的值和所有其他较低级别组的值。
我希望包含级别“A2”和“A3”。这意味着当焦点级别为“A2”时,计算包括“A2”组中的值和较低级别“A1”中的值。在计算“A3”的平均值时,计算包括焦点级别“A3”中的所有值,以及所有更低级别,即“A2”和“A1”组中的值。 >
我的目标是有这样的输出:
MeanValue level
1 1.00 A1 # mean of level A1
2 0.71 A2 # mean of level A2 and A1
3 0.60 A3 # mean of level A3, A2, and A1
有人可以告诉我如何编码吗?我真的很感激任何帮助!
答案 0 :(得分:4)
使用 dplyr
和 purrr
的一个选项可能是:
df %>%
arrange(level) %>%
mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
group_by(level) %>%
summarise(mean_value = last(mean_value))
level mean_value
<fct> <dbl>
1 A1 1
2 A2 0.714
3 A3 0.6
答案 1 :(得分:2)
这里有两个 data.table
替代方案。如上所述,计算了累积平均值,但这里使用了 base
和 data.table
函数。
setDT(df)[order(level), .(level = unique(level),
mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]
setDT(df)[order(level), .(level = unique(level),
mean_value = (cumsum(value) / .I)[rev(rowid(rev(level))) == 1])]
# level mean_value
# 1: A1 1.0000000
# 2: A2 0.7142857
# 3: A3 0.6000000
虽然速度对 OP 来说不是问题,但对于未来的访问者来说,值得注意的是,即使在一个适度的数据集上,200 个组,每个组 200 行,这些替代方案都比 tidyverse
代码快得多。
为基准准备数据,200 组,每组 200 行:
ng = 2e2
n = 2e2
set.seed(1)
df = data.frame(value = rbinom(ng * n, 1, 0.5),
level = rep(seq(ng), each = n)
df3 = df
df4 = df
基准:
microbenchmark(
f1 = {
d1 = df %>%
arrange(level) %>%
mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
group_by(level) %>%
summarise(mean_value = last(mean_value))
},
f2 = {
d2 = df %>%
arrange(level) %>%
summarise(mean_value = cummean(value)[!duplicated(level, fromLast = TRUE)],
level = unique(level)) %>%
select(level, mean_value)
},
f3 = {
d3 = setDT(df3)[order(level), .(level = unique(level), mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]
},
f4 = {
d4 = setDT(df4)[order(level), data.table(level = level, mean_value = cumsum(value) / .I)[rev(rowid(rev(level))) == 1]]
},
times = 10L)
结果:
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 2834.601301 2869.658601 3052.233971 3030.448801 3191.671301 3303.532801 10
# f2 127.474801 130.049101 138.202461 135.293550 141.301601 173.230901 10
# f3 2.929702 3.401802 3.706061 3.450152 3.838801 5.585200 10
# f4 3.888700 3.984301 4.372161 4.166951 4.844901 5.198001 10
all.equal(as.data.frame(d1), d2)
all.equal(as.data.frame(d1), as.data.frame(d3))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d5))
# TRUE
另一种方法是非对等连接。在这种特殊情况下有点慢,与累积平均技巧竞争,但它更通用,可以处理更复杂的函数。
setDT(df)[ , level := factor(level)]
df[.(level = sort(unique(df$level))), on = .(level <= level),
.(mean_value = mean(value)), by = .EACHI]
答案 2 :(得分:1)
一个选项是使用 cummean
library(dplyr)
df %>%
arrange(level) %>%
summarise(mean_value = cummean(value)[!duplicated(level,
fromLast = TRUE)], level = unique(level)) %>%
select(level, mean_value)
# level mean_value
#1 A1 1.0000000
#2 A2 0.7142857
#3 A3 0.6000000