用R查找组和当前行的最大值之间的差

时间:2018-07-28 06:20:19

标签: r dplyr data.table tidyr

类似于这个问题here,我试图找到一个组的最大值和当前行的值之间的差异。

例如,如果我有以下数据集:

Get the job instances of your job with JobOperator#getJobInstances

For each instance, check if there is a running execution using JobOperator#getExecutions.

2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)

2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.

我将如何创建一个名为“ diff”的新列,该列将是当前行的值与该组中最大值之间的差?

谢谢您的帮助!

3 个答案:

答案 0 :(得分:6)

OP尝试了data.table解决方案。在这里,我们受益于同时通过引用进行分组和更新。

library(data.table)
setDT(group)[, diff := max(pt) - pt, by = Subject][]
   Subject pt diff
1:       1  2    3
2:       1  3    2
3:       1  5    0
4:       2  2   15
5:       2  5   12
6:       2  8    9
7:       2 17    0
8:       3  3    2
9:       3  5    0

数据

ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)

基准

在撰写本文时,已发布了5个答案,其中包括Frank's commentdata.table方法的效率。因此,我想知道这五个解决方案中哪一个是最快的。

  1. r2evans
  2. 我的
  3. Frank
  4. harelhan
  5. JonMinton

某些解决方案在适当位置修改data.frame。为确保公平比较,此外

OP要求创建一个名为“ diff”的新列。为了进行比较,所有结果应返回包含三列的group。相应地修改了一些答案。 harelhan的答案要求进行实质性修改才能消除错误。

group被修改后,所有基准测试运行都从group的新副本开始,该副本有两列。

基准测试是针对行数和组的份额进行参数化的,即,组的数量随问题规模的变化而变化,以进行扩展。

library(data.table)
library(dplyr)
library(bench)
bm <- press(
  # n_row = c(1E2, 1E4, 1E5, 1E6),
  n_row = c(1E2, 1E4, 1E5),
  grp_share = c(0.01, 0.1, 0.5, 0.9),
  {
    n_grp <- grp_share * n_row
    set.seed(1)
    group0 <- data.frame(
      Subject = sample(n_grp, n_row, TRUE),
      pt = as.numeric(rpois(n_row, 100)))
    mark(
      r2Evans = {
        group <- copy(group0)
        group <- group %>% 
          group_by(Subject) %>% 
          mutate(diff = max(pt) - pt)
        group
      },
      Uwe = {
        group <- copy(group0)
        setDT(group)[, diff := max(pt) - pt, by = Subject]
        group
      },
      Frank = {
        group <- copy(group0)
        setDT(group)[, mx := max(pt), by=Subject][, diff := mx - pt][, mx := NULL]
        group
      },
      harelhan = {
        group <- copy(group0)
        max_group <- group %>% group_by(Subject) %>% summarize(max_val = max(pt))
        group <- left_join(group, max_group[, c("Subject", "max_val")], by = "Subject")
        group$diff <- group$max_val - group$pt
        group <- group %>% select(-max_val)
        group
      },
      JonMinton = {
        group <- copy(group0)
        group <- group %>% 
          group_by(Subject) %>% 
          mutate(max_group_val = max(pt)) %>% 
          ungroup() %>% 
          mutate(diff = max_group_val - pt) %>% 
          select(-max_group_val)
        group
      }
    )
  }
)
ggplot2::autoplot(bm)

enter image description here

答案 1 :(得分:1)

使用示例数据并将逻辑分为较小的步骤:

library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
max_group <- group %>% group_by(ID) %>% summarize(max_val = max(Value))
group <- left_join(group, max_group[,c(ID, max_val)], by = ID)
group$diff <- group$max_val - group$Value

希望这可以解决问题。

答案 2 :(得分:1)

基于harelhan的答案,但带有管道:

broom