N个最近值的摘要

时间:2019-01-17 20:22:34

标签: r dplyr tibble

我正在尝试获取具有N个最近值的摘要统计信息(此处为总和)。

开始数据:

dt = data.table(id = c('a','a','a','a','b','b','b','b'),
                week = c(1,2,3,4,1,2,3,4),
                value = c(2, 3, 1, 0, 5, 7,3,2))

所需结果:

dt = data.table(id = c('a','a','a','a','b','b','b','b'),
                    week = c(1,2,3,4,1,2,3,4),
                    value = c(2, 3, 1, 0, 5, 7,3,2),
                    sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
                    max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))

对于数据,我想通过id为每行提供2个总和和最大值(N = 2)个最新值。第4(sum_recent2week)和第5(max_recent2week)列是我想要的列

2 个答案:

答案 0 :(得分:4)

您可以使用rollsum软件包中的rollmaxzoo

dt[, `:=`(sum_recent2week = 
            shift(rollsum(value, 2, align = 'left', fill = NA), 2),
          max_recent2week = 
            shift(rollmax(value, 2, align = 'left', fill = NA), 2))
   , id]

作为总和,如果您使用的数据表版本> = 1.12,则可以使用data.table::frollmeanfrollmean的默认值为fill = NA,因此在这种情况下无需指定。

dt[, `:=`(sum_recent2week = 
            shift(frollmean(value, 2, align = 'left')*2, 2),
          max_recent2week = 
            shift(rollmax(value, 2, align = 'left', fill = NA), 2))
   , id]

答案 1 :(得分:1)

我敢肯定,可以用一种更加优雅的方式来完成此操作,但这是一种tidyverse的可能性:

dt %>%
 group_by(id) %>%
 mutate(sum_recent2week = lag(value + lead(value), n = 2),
        max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
 rowid_to_column() %>%
 select(-week, -value) %>%
 top_n(-2) %>%
 right_join(dt %>%
            rowid_to_column(), by = c("rowid" = "rowid",
                                      "id" = "id")) %>%
 select(-rowid)

  id    sum_recent2week max_recent2week  week value
  <chr>           <dbl>           <dbl> <dbl> <dbl>
1 a                 NA              NA     1.    2.
2 a                 NA              NA     2.    3.
3 a                  5.              3.    3.    1.
4 a                  4.              3.    4.    0.
5 b                 NA              NA     1.    5.
6 b                 NA              NA     2.    7.
7 b                 12.              7.    3.    3.
8 b                 10.              7.    4.    2.

首先,它正在计算每个组的“ sum_recent2week”和“ max_recent2week”。其次,它选择每个组的最后两行。最后,它将其与原始数据合并。

或者如果您想为所有行而不是每个组的最后两行计算它:

dt %>%
 group_by(id) %>%
 mutate(sum_recent2week = lag(value + lead(value), n = 2),
        max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))