Question

我正在查询具有以下结构的数据库：

df <- data.frame(id = c(1, 2, 2, 1, 2),
             type = c("A", "B1", "B2", "A", "B1"),
             date = as.POSIXct(c("2018-07-23 6:00", "2018-07-23 6:12", 
                                 "2018-07-23 6:25", "2018-07-23 10:00", 
                                 "2018-07-23 10:30")),
             value = c(5, 2, 3, 7, 4))

  id type                date value
1  1    A 2018-07-23 06:00:00     5
2  2   B1 2018-07-23 06:12:00     2
3  2   B2 2018-07-23 06:25:00     3
4  1    A 2018-07-23 10:00:00     7
5  2   B1 2018-07-23 10:30:00     4

date变量表示何时对数据库进行了更改。我的目标是重新格式化数据框，使我能够为每行/新条目标识value变量的条目总和。在计算总和时，type变量中所有类别的值都需要在该类别的新条目出现时被替换。

这是预期的输出：

  id type                date value combined_value
1  1    A 2018-07-23 06:00:00     5              5
2  2   B1 2018-07-23 06:12:00     2              7
3  2   B2 2018-07-23 06:25:00     3             10
4  1    A 2018-07-23 10:00:00     7             12
5  2   B1 2018-07-23 10:30:00     4             14

前三行中combined_value的值只是value的和，在第四行中，type == "A"的值从5变为7，必须替换，但是type == "B1"和type == "B2"的值保持不变。 type == "B1"在第五行中发生更改，因此在计算combined_value中的总和时必须替换掉。

到目前为止，我已经成功地使用group_by()，filter()和summarise()的组合来计算预定义时间点的总和。但是，我希望能够使用一个数据框来跟踪过去一年中所做的所有更改，该数据框将所有条目存储在行中，并包含有关type中各个类别的相应当前总和的信息。 / p>

编辑：@jaySf提供的解决方案适用于提供的示例数据。但是，我的实际数据集包含大量需要计算当前总和的组。这是反映结构的更新数据框，其中id表示组索引：

df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
             type = c("A", "B1", "B2", "A", "B1", 
                      "A", "A", "B2", "B3", "A"),
             date = as.POSIXct(c("2018-07-23 6:00", "2018-07-23 6:12", 
                                 "2018-07-23 6:25", "2018-07-23 10:00", 
                                 "2018-07-23 10:30")),
             value = c(5, 2, 3, 7, 4, 3, 5, 1, 2, 7))

相应地，预期输出为：

   id type                date value combined_value
1   1    A 2018-07-23 06:00:00     5              5
2   1   B1 2018-07-23 06:12:00     2              7
3   1   B2 2018-07-23 06:25:00     3             10
4   1    A 2018-07-23 10:00:00     7             12
5   1   B1 2018-07-23 10:30:00     4             14
6   2    A 2018-07-23 06:00:00     3              3
7   2    A 2018-07-23 06:12:00     5              5
8   2   B2 2018-07-23 06:25:00     1              6
9   2   B3 2018-07-23 10:00:00     2              8
10  2    A 2018-07-23 10:30:00     7             10

我尝试tapply来说明我的小组，但无法使代码正常工作。

Answer 1

我可以提供基本的R解决方案。

我们可以基于type列逐行累加每个date的最新值。之后，我们实现每个id组。

actualizeIDs <- function(df) sapply(
  lapply(seq_along(df[, 1]), 
         function(y) {
           d <- df[1:y, ]
           sapply(unique(d$type), 
                  function(x) {
                    d[d$type == x & d$date == max(d$date[d$type == x]), "value"]
                  }
           )
         }
  ), sum)

actualizeGroups <- function(df) {
  if (length(which(duplicated(df[, -4]))) > 0) {
    warning("Duplicated measurements, using latest row-number.")
  df <- df[- which(duplicated(df[, -4], fromLast=TRUE)), ]
  }
  df <- with(df, df[order(id, date), ])
  df$combined_value <- matrix(sapply(unique(df$id), 
                                     function(x) {
                                       actualizeIDs(df[df$id == x, ])
                                     }))
  return(df)
}

屈服

> actualizeGroups(df)
   id type                date value combined_value
1   1    A 2018-07-23 06:00:00     5              5
2   1   B1 2018-07-23 06:12:00     2              7
3   1   B2 2018-07-23 06:25:00     3             10
4   1    A 2018-07-23 10:00:00     7             12
5   1   B1 2018-07-23 10:30:00     4             14
6   2    A 2018-07-23 06:00:00     3              3
7   2    A 2018-07-23 06:12:00     5              5
8   2   B2 2018-07-23 06:25:00     1              6
9   2   B3 2018-07-23 10:00:00     2              8
11  2    A 2018-07-23 10:30:00     8             11
Warning message:
In actualizeGroups(df) : Duplicated measurements, using latest row-number.

数据

df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
                 type = c("A", "B1", "B2", "A", "B1", 
                          "A", "A", "B2", "B3", "A", "A"),
                 date = as.POSIXct(c("2018-07-23 6:00", "2018-07-23 6:12", 
                                     "2018-07-23 6:25", "2018-07-23 10:00", 
                                     "2018-07-23 10:30", "2018-07-23 6:00", 
                                     "2018-07-23 6:12", "2018-07-23 6:25", 
                                     "2018-07-23 10:00", "2018-07-23 10:30", 
                                     "2018-07-23 10:30")),
                 value = c(5, 2, 3, 7, 4, 3, 5, 1, 2, 7, 8))

基于行的时间序列中多个列的当前总和

1 个答案: