类似于这个问题here,我试图找到一个组的最大值和当前行的值之间的差异。
例如,如果我有以下数据集:
Get the job instances of your job with JobOperator#getJobInstances
For each instance, check if there is a running execution using JobOperator#getExecutions.
2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.
我将如何创建一个名为“ diff”的新列,该列将是当前行的值与该组中最大值之间的差?
谢谢您的帮助!
答案 0 :(得分:6)
OP尝试了data.table
解决方案。在这里,我们受益于同时通过引用进行分组和更新。
library(data.table)
setDT(group)[, diff := max(pt) - pt, by = Subject][]
Subject pt diff 1: 1 2 3 2: 1 3 2 3: 1 5 0 4: 2 2 15 5: 2 5 12 6: 2 8 9 7: 2 17 0 8: 3 3 2 9: 3 5 0
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
在撰写本文时,已发布了5个答案,其中包括Frank's comment种data.table
方法的效率。因此,我想知道这五个解决方案中哪一个是最快的。
某些解决方案在适当位置修改data.frame。为确保公平比较,此外
OP要求创建一个名为“ diff”的新列。为了进行比较,所有结果应返回包含三列的group
。相应地修改了一些答案。 harelhan的答案要求进行实质性修改才能消除错误。
在group
被修改后,所有基准测试运行都从group
的新副本开始,该副本有两列。
基准测试是针对行数和组的份额进行参数化的,即,组的数量随问题规模的变化而变化,以进行扩展。
library(data.table)
library(dplyr)
library(bench)
bm <- press(
# n_row = c(1E2, 1E4, 1E5, 1E6),
n_row = c(1E2, 1E4, 1E5),
grp_share = c(0.01, 0.1, 0.5, 0.9),
{
n_grp <- grp_share * n_row
set.seed(1)
group0 <- data.frame(
Subject = sample(n_grp, n_row, TRUE),
pt = as.numeric(rpois(n_row, 100)))
mark(
r2Evans = {
group <- copy(group0)
group <- group %>%
group_by(Subject) %>%
mutate(diff = max(pt) - pt)
group
},
Uwe = {
group <- copy(group0)
setDT(group)[, diff := max(pt) - pt, by = Subject]
group
},
Frank = {
group <- copy(group0)
setDT(group)[, mx := max(pt), by=Subject][, diff := mx - pt][, mx := NULL]
group
},
harelhan = {
group <- copy(group0)
max_group <- group %>% group_by(Subject) %>% summarize(max_val = max(pt))
group <- left_join(group, max_group[, c("Subject", "max_val")], by = "Subject")
group$diff <- group$max_val - group$pt
group <- group %>% select(-max_val)
group
},
JonMinton = {
group <- copy(group0)
group <- group %>%
group_by(Subject) %>%
mutate(max_group_val = max(pt)) %>%
ungroup() %>%
mutate(diff = max_group_val - pt) %>%
select(-max_group_val)
group
}
)
}
)
ggplot2::autoplot(bm)
答案 1 :(得分:1)
使用示例数据并将逻辑分为较小的步骤:
library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
max_group <- group %>% group_by(ID) %>% summarize(max_val = max(Value))
group <- left_join(group, max_group[,c(ID, max_val)], by = ID)
group$diff <- group$max_val - group$Value
希望这可以解决问题。
答案 2 :(得分:1)
基于harelhan的答案,但带有管道:
broom