通过附加约束按组获取最大值

时间:2020-06-14 14:41:57

标签: r tibble

我有一个data.frame,其中包含4个变量: day (日期,格式:“ YYYY-MM-DD”), hour (POSIXct,格式:“ YYYY” -MM-DD hh:mm:ss“),部门(chr)和金额(数字)。

          day                hour department amount max_cond
1  2019-08-08 2019-08-08 11:45:00       DPT1      2        3
2  2019-08-08 2019-08-08 12:00:00       DPT1      3        3
3  2019-08-08 2019-08-08 12:15:00       DPT1      3        3
4  2019-08-08 2019-08-08 12:30:00       DPT1      2        2
5  2019-08-08 2019-08-08 12:45:00       DPT1      0        2
6  2019-08-08 2019-08-08 13:00:00       DPT1      0        2
7  2019-08-08 2019-08-08 13:15:00       DPT1      1        2
8  2019-08-08 2019-08-08 13:30:00       DPT1      2        2
9  2019-08-08 2019-08-08 13:45:00       DPT1      1        1
10 2019-08-08 2019-08-08 11:45:00       DPT2      3        3
11 2019-08-08 2019-08-08 12:00:00       DPT2      3        3
12 2019-08-08 2019-08-08 12:15:00       DPT2      3        3
13 2019-08-08 2019-08-08 12:30:00       DPT2      2        3
14 2019-08-08 2019-08-08 12:45:00       DPT2      2        3
15 2019-08-08 2019-08-08 13:00:00       DPT2      3        3
16 2019-08-08 2019-08-08 13:15:00       DPT2      0        0
17 2019-08-08 2019-08-08 13:30:00       DPT2      0        0
18 2019-08-08 2019-08-08 13:45:00       DPT2      0        0

对于data.frame的每一行,我想要获取金额的最大值,该值按部门分组,但仅一天中大于或等于相应行的小时的小时。

换句话说,对于每个观察值[ day_i,hour_i,department_i ],我想要得到:max( amount |( day = = day_i 部门 == department_i 小时> = hour_i ))。

对于上面的示例,我们应该有:

.container:before{display:table;content:" "}

2 个答案:

答案 0 :(得分:2)

非常相似,但是可以使用data.table

library(data.table)

df <- structure(list(
  day = structure(c(18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116), class = "Date"), 
  hour = structure(c(1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700, 1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700), class = c("POSIXct", "POSIXt"), tzone = ""), 
  department = c("DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2"), 
  amount = c(2, 3, 3, 2, 0, 0, 1, 2, 1, 3, 3, 3, 2, 2, 3, 0, 0, 0), max_cond = c(3, 3, 3, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3, 3, 0, 0, 0)), row.names = c(NA, -18L), class = "data.frame")

dt = data.table(df)
setorder(dt, -hour)
dt[,max_cond_new:=cummax(amount),by=.(day,department)]
setorder(dt, department, hour)

希望这会有所帮助!

答案 1 :(得分:0)

一种base R方法:您可以使用cummax()最终 max 最大)来解决此问题。 请注意,我假设您的数据框已对hour进行了排序,在您的示例中就是这种情况。

这个想法是:首先将数据帧split()分成具有不同的datedepartment的组件。然后,在每个组件中:

  • 反转相关向量$day
  • $max_cond构造cummax()变量(相反)
  • $max_cond变量翻转回正确的顺序

然后,将所有组件与do.call()rbind()粘在一起。

以您的示例为例:

df2 <- split(df, list(df$department, df$day))
df2 <- lapply(df2, function(x) {
  x$max_cond <- x[order(x$hour, decreasing = T), ]$amount %>%
    cummax %>%
    sort(decreasing = T)
  x
})

df2 <- do.call(rbind, df2)
row.names(df2) <- NULL

df2
##           day                hour department amount max_cond
## 1  2019-08-08 2019-08-08 10:45:00       DPT1      2        3
## 2  2019-08-08 2019-08-08 11:00:00       DPT1      3        3
## 3  2019-08-08 2019-08-08 11:15:00       DPT1      3        3
## 4  2019-08-08 2019-08-08 11:30:00       DPT1      2        2
## 5  2019-08-08 2019-08-08 11:45:00       DPT1      0        2
## 6  2019-08-08 2019-08-08 12:00:00       DPT1      0        2
## 7  2019-08-08 2019-08-08 12:15:00       DPT1      1        2
## 8  2019-08-08 2019-08-08 12:30:00       DPT1      2        2
## 9  2019-08-08 2019-08-08 12:45:00       DPT1      1        1
## 10 2019-08-08 2019-08-08 10:45:00       DPT2      3        3
## 11 2019-08-08 2019-08-08 11:00:00       DPT2      3        3
## 12 2019-08-08 2019-08-08 11:15:00       DPT2      3        3
## 13 2019-08-08 2019-08-08 11:30:00       DPT2      2        3
## 14 2019-08-08 2019-08-08 11:45:00       DPT2      2        3
## 15 2019-08-08 2019-08-08 12:00:00       DPT2      3        3
## 16 2019-08-08 2019-08-08 12:15:00       DPT2      0        0
## 17 2019-08-08 2019-08-08 12:30:00       DPT2      0        0
## 18 2019-08-08 2019-08-08 12:45:00       DPT2      0        0