我想在identical()
内使用mutate()
,我会感到很奇怪"结果。我在这里遗漏了什么或这是一个错误吗?
考虑以下示例:
dat <- data.frame(x = 1:4, y = c(1, 2, 10, NA))
我想检查y
是否与x
不同:
mutate(dat, diff = x != y)
# x y diff
# 1 1 1 FALSE
# 2 2 2 FALSE
# 3 3 10 TRUE
# 4 4 NA NA
有&#34;问题&#34;与NA,所以我转向相同:
mutate(dat, diff = !identical(x, y))
# x y diff
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE
嗯,那有点奇怪&gt;&gt;调查并发现它与不同的数据类型有关:
class(dat$x)
# [1] "integer"
class(dat$y)
# [1] "numeric"
因此,让我们来协调一下:
dat$x <- as.numeric(dat$x)
dat$y <- as.numeric(dat$y)
现在,我会直觉地认为mutate会给我相同的结果:
sapply(1:nrow(dat), function(ii) {
!identical(dat[ii, "x"], dat[ii, "y"])
})
# [1] FALSE FALSE TRUE TRUE
但它仍然给了我这个:
mutate(dat, diff = !identical(x, y))
# x y diff
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE
虽然我期待这个
# x y diff
# 1 1 1 FALSE
# 2 2 2 FALSE
# 3 3 10 TRUE
# 4 4 NA TRUE
这是什么原因和/或我将如何解决这个问题所以我仍然可以使用mutate
(我真的很喜欢)?
identicalVectorized <- function(x, y) {
(x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))
}
identicalVectorized2 <- function(x, y) {
sapply(1:length(x), function(ii) {
!identical(x[ii], y[ii])
})
}
dat <- data.frame(x = as.numeric(c(1:4,NA, NA)),
y = as.numeric(c(1, 2, 10, NA, 15, NA)))
microbenchmark::microbenchmark(
mutate(dat, diff = identicalVectorized(x, y)),
mutate(dat, diff = identicalVectorized2(x, y))
)
结果
Unit: microseconds
expr min lq mean median uq max neval
mutate(dat, diff = identicalVectorized(x, y)) 31.672 34.164 38.79999 35.777 37.6825 120.526 100
mutate(dat, diff = identicalVectorized2(x, y)) 58.064 60.703 66.66150 62.462 72.7260 117.593 100
答案 0 :(得分:1)
这可能是你最好的选择:
dat <- data.frame(x = c(1:4,NA), y = c(1, 2, 10, NA, 15))
mutate(dat, diff = x != y | is.na(x) | is.na(y))
如果你想要NA == NA为TRUE(它不在R中),请使用:
mutate(dat, diff = (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y)))
编辑: 如果你想反转真/假,你可以这样做:
将整个东西包裹在parachesis中并放入!在前面所以:
mutate(dat, diff = !((x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))))
或者您可以重新考虑逻辑:
mutate(dat, diff = (x == y & !(is.na(x) & !is.na(y)) & !(!is.na(x) & is.na(y)) | (is.na(x) & is.na(y))))