Question

我有一个数据集，描述了将3种算法应用于多个个案的结果。对于算法和 case 的每个组合，有一个结果。

df = data.frame(
  c("case1", "case1", "case1", "case2", "case2", "case2"),
  c("algo1", "algo2", "algo3", "algo1", "algo2", "algo3"),
  c(10, 11, 12, 22, 23, 20)
  );
names(df) <- c("case", "algorithm", "result");
df

这些算法旨在最小化结果值。因此，对于每个算法和案例，我希望计算到最低实现结果的差距，由同一案例的任何算法实现。

gap <- function(caseId, result) {
  filtered = subset(df, case==caseId)
  return (result - min(filtered[,'result']));
}

当我手动应用该功能时，我得到了预期的结果。

gap("case1", 10)  # prints 0, since 10 is the best value for case1
gap("case1", 11)  # prints 1, since 11-10=1
gap("case1", 12)  # prints 2, since 12-10=1

gap("case2", 22)  # prints 2, since 22-20=2
gap("case2", 23)  # prints 3, since 23-20=3
gap("case2", 20)  # prints 0, since 20 is the best value for case2

但是，当我想在整个数据集中计算一个新列时，我得到了case2的伪造结果。

df$gap <- gap(df$case, df$result)
df

这会产生

   case algorithm result gap
1 case1     algo1     10   0
2 case1     algo2     11   1
3 case1     algo3     12   2
4 case2     algo1     22  12
5 case2     algo2     23  13
6 case2     algo3     20  10

现在看来，gap函数正在对整个数据帧的整体结果最小值起作用，而它应该只考虑具有相同 case 的行。也许间隙函数中的子集过滤不能正常工作？

Answer 1

使用ave获取每个组的最小值，并从result

中减去

df$result - ave(df$result, df$case, FUN = min)
#[1] 0 1 2 2 3 0

Answer 2

我们可以使用dplyr

library(dplyr)
df %>%
  group_by(case) %>% 
  mutate(result = result - min(result))
# A tibble: 6 x 3
# Groups:   case [2]
#    case algorithm result
#   <fctr>    <fctr>  <dbl>
#1  case1     algo1      0
#2  case1     algo2      1
#3  case1     algo3      2
#4  case2     algo1      2
#5  case2     algo2      3
#6  case2     algo3      0

计算R

2 个答案: