Question

我想在identical()内使用mutate()，我会感到很奇怪＆＃34;结果。我在这里遗漏了什么或这是一个错误吗？

考虑以下示例：

dat <- data.frame(x = 1:4, y = c(1, 2, 10, NA))

我想检查y是否与x不同：

mutate(dat, diff = x != y)
# x  y  diff
# 1 1  1 FALSE
# 2 2  2 FALSE
# 3 3 10  TRUE
# 4 4 NA    NA

有＆＃34;问题＆＃34;与NA，所以我转向相同：

mutate(dat, diff = !identical(x, y))
# x  y diff
# 1 1  1 TRUE
# 2 2  2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE

嗯，那有点奇怪＆gt;＆gt;调查并发现它与不同的数据类型有关：

class(dat$x)
# [1] "integer"
class(dat$y)
# [1] "numeric"

因此，让我们来协调一下：

dat$x <- as.numeric(dat$x)
dat$y <- as.numeric(dat$y)

现在，我会直觉地认为mutate会给我相同的结果：

sapply(1:nrow(dat), function(ii) {
  !identical(dat[ii, "x"], dat[ii, "y"])
})
# [1]  FALSE FALSE TRUE TRUE

但它仍然给了我这个：

mutate(dat, diff = !identical(x, y))
# x  y diff
# 1 1  1 TRUE
# 2 2  2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE

虽然我期待这个

# x  y diff
# 1 1  1 FALSE
# 2 2  2 FALSE
# 3 3 10 TRUE
# 4 4 NA TRUE

这是什么原因和/或我将如何解决这个问题所以我仍然可以使用mutate（我真的很喜欢）？

更新

哇，速度有什么不同！

identicalVectorized <- function(x, y) {
  (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))
}

identicalVectorized2 <- function(x, y) {
  sapply(1:length(x), function(ii) {
    !identical(x[ii], y[ii])
  })
}

dat <- data.frame(x = as.numeric(c(1:4,NA, NA)), 
    y = as.numeric(c(1, 2, 10, NA, 15, NA)))

microbenchmark::microbenchmark(
  mutate(dat, diff = identicalVectorized(x, y)),
  mutate(dat, diff = identicalVectorized2(x, y))
)

结果

Unit: microseconds
                                           expr    min     lq     mean median      uq     max neval
  mutate(dat, diff = identicalVectorized(x, y)) 31.672 34.164 38.79999 35.777 37.6825 120.526   100
 mutate(dat, diff = identicalVectorized2(x, y)) 58.064 60.703 66.66150 62.462 72.7260 117.593   100

Answer 1

这可能是你最好的选择：

dat <- data.frame(x = c(1:4,NA), y = c(1, 2, 10, NA, 15))
mutate(dat, diff = x != y | is.na(x) | is.na(y))

如果你想要NA == NA为TRUE（它不在R中），请使用：

mutate(dat, diff = (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y)))

编辑：如果你想反转真/假，你可以这样做：

将整个东西包裹在parachesis中并放入！在前面所以：

mutate(dat, diff = !((x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))))
或者您可以重新考虑逻辑： mutate(dat, diff = (x == y & !(is.na(x) & !is.na(y)) & !(!is.na(x) & is.na(y)) | (is.na(x) & is.na(y))))

在dplyr :: mutate（）中使用same（）的问题

更新

1 个答案: