Question

前些时候，他们在ifelse中引入了一种类似于SQL的漂亮替代dplyr的替代方法，即case_when。

data.table中是否存在等效项，可让您在一个[]语句中指定不同的条件，而无需加载其他程序包？

示例：

library(dplyr)

df <- data.frame(a = c("a", "b", "a"), b = c("b", "a", "a"))

df <- df %>% mutate(
    new = case_when(
    a == "a" & b == "b" ~ "c",
    a == "b" & b == "a" ~ "d",
    TRUE ~ "e")
    )

  a b new
1 a b   c
2 b a   d
3 a a   e

这肯定会非常有帮助，并使代码更具可读性（在这种情况下，我继续使用dplyr的原因之一）。

Answer 1

1）：如果条件与所有条件均为假的条件互斥，则默认设置为

library(data.table)
DT <- as.data.table(df) # df is from question

DT[, new := c("e", "c", "d")[1 +
                             1 * (a == "a" & b == "b") + 
                             2 * (a == "b" & b == "a")]
]

给予：

> DT
   a b new
1: a b   c
2: b a   d
3: a a   e

2）如果条件的结果为数字，则更加简单。例如，假设我们需要10和17而不是c和d，默认值为3。然后：

library(data.table)
DT <- as.data.table(df) # df is from question

DT[, new := 3 + 
            (10 - 3) * (a == "a" & b == "b") + 
            (17 - 3) * (a == "b" & b == "a")]

3）注意，添加1-liner足以实现此目的。假定每行至少有一个TRUE分支。

when <- function(...) names(match.call()[-1])[apply(cbind(...), 1, which.max)]

# test
DT[, new := when(c = a == 'a' & b == 'b', 
                 d = a == 'b' & b == 'a', 
                 e = TRUE)]

Answer 2

这不是一个真正的答案，但是评论太长了。如果认为不合适，我很乐意删除该帖子。

有一个interesting post on RStudio Community讨论了使用dplyr::case_when而没有通常的tidyverse依赖性的选项。

总而言之，似乎存在三种选择：

Stefan Fleck与case_when隔离dplyr并构建仅依赖于base的新软件包lest。
yonicd开发了noplyr，它“提供了基本的dplyr和tidyr功能，而没有tidyverse依赖性”。
Bob Rudis (hrbrmstr)是freebase的创建者，{{3}}是“类似'usethis'的程序包，用于'tidyverse'代码的Base R伪等效项”，这也许也值得一试。

如果您只追求case_when，我想lest可能是与data.table结合的有吸引力且最小的选择。

Answer 3

FYI，这是2019年以后发布的内容的最新解答。data.table的最新开发版本具有fcase函数，正是为此提供了功能。实施：

# Lazy evaluation
x = 1:10

dplyr::case_when(
    x < 5L ~ 1L,
    x >= 5L ~ 3L,
    x == 5L ~ stop("provided value is an unexpected one!")
)
# [1] 1 1 1 1 3 3 3 3 3 3

data.table::fcase(
    x < 5L, 1L,
    x >= 5L, 3L,
    x == 5L, stop("provided value is an unexpected one!")
)
# Error in eval_tidy(pair$rhs, env = default_env) :
#  provided value is an unexpected one!

# Benchmark
x = sample(1:100, 3e7, replace = TRUE) # 114 MB
microbenchmark::microbenchmark(
dplyr::case_when(
  x < 10L ~ 0L,
  x < 20L ~ 10L,
  x < 30L ~ 20L,
  x < 40L ~ 30L,
  x < 50L ~ 40L,
  x < 60L ~ 50L,
  x > 60L ~ 60L
),
data.table::fcase(
  x < 10L, 0L,
  x < 20L, 10L,
  x < 30L, 20L,
  x < 40L, 30L,
  x < 50L, 40L,
  x < 60L, 50L,
  x > 60L, 60L
),
times = 5L,
unit = "s")
# Unit: seconds
#               expr   min    lq  mean   median    uq    max neval
# dplyr::case_when   11.57 11.71 12.22    11.82 12.00  14.02     5
# data.table::fcase   1.49  1.55  1.67     1.71  1.73   1.86     5

Source，在“ data.table v1.12.9（正在开发）”下。应该会很快发布，大概在2020年1月。

Answer 4

这是@ g-grothendieck答案的一种变体，适用于非排他性条件：

DT[, new := c("c", "d", "e")[
  apply(cbind(
    a == "a" & b == "b", 
    a == "b" & b == "a",
    TRUE), 1, which.max)]
  ]

DT
#    a b new
# 1: a b   c
# 2: b a   d
# 3: a a   e

dplyr case_when的data.table替代

4 个答案: