我正在尝试获取具有N个最近值的摘要统计信息(此处为总和)。
开始数据:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2))
所需结果:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2),
sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))
对于数据,我想通过id为每行提供2个总和和最大值(N = 2)个最新值。第4(sum_recent2week
)和第5(max_recent2week
)列是我想要的列
答案 0 :(得分:4)
您可以使用rollsum
软件包中的rollmax
和zoo
。
dt[, `:=`(sum_recent2week =
shift(rollsum(value, 2, align = 'left', fill = NA), 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
作为总和,如果您使用的数据表版本> = 1.12,则可以使用data.table::frollmean
。 frollmean
的默认值为fill = NA
,因此在这种情况下无需指定。
dt[, `:=`(sum_recent2week =
shift(frollmean(value, 2, align = 'left')*2, 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
答案 1 :(得分:1)
我敢肯定,可以用一种更加优雅的方式来完成此操作,但这是一种tidyverse
的可能性:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
rowid_to_column() %>%
select(-week, -value) %>%
top_n(-2) %>%
right_join(dt %>%
rowid_to_column(), by = c("rowid" = "rowid",
"id" = "id")) %>%
select(-rowid)
id sum_recent2week max_recent2week week value
<chr> <dbl> <dbl> <dbl> <dbl>
1 a NA NA 1. 2.
2 a NA NA 2. 3.
3 a 5. 3. 3. 1.
4 a 4. 3. 4. 0.
5 b NA NA 1. 5.
6 b NA NA 2. 7.
7 b 12. 7. 3. 3.
8 b 10. 7. 4. 2.
首先,它正在计算每个组的“ sum_recent2week”和“ max_recent2week”。其次,它选择每个组的最后两行。最后,它将其与原始数据合并。
或者如果您想为所有行而不是每个组的最后两行计算它:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))