在dplyr :: mutate()中使用same()的问题

时间:2016-08-25 17:47:03

标签: r types dplyr vectorization

我想在identical()内使用mutate(),我会感到很奇怪"结果。我在这里遗漏了什么或这是一个错误吗?

考虑以下示例:

dat <- data.frame(x = 1:4, y = c(1, 2, 10, NA))

我想检查y是否与x不同:

mutate(dat, diff = x != y)
# x  y  diff
# 1 1  1 FALSE
# 2 2  2 FALSE
# 3 3 10  TRUE
# 4 4 NA    NA

有&#34;问题&#34;与NA,所以我转向相同:

mutate(dat, diff = !identical(x, y))
# x  y diff
# 1 1  1 TRUE
# 2 2  2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE

嗯,那有点奇怪&gt;&gt;调查并发现它与不同的数据类型有关:

class(dat$x)
# [1] "integer"
class(dat$y)
# [1] "numeric"

因此,让我们来协调一下:

dat$x <- as.numeric(dat$x)
dat$y <- as.numeric(dat$y)

现在,我会直觉地认为mutate会给我相同的结果:

sapply(1:nrow(dat), function(ii) {
  !identical(dat[ii, "x"], dat[ii, "y"])
})
# [1]  FALSE FALSE TRUE TRUE

但它仍然给了我这个:

mutate(dat, diff = !identical(x, y))
# x  y diff
# 1 1  1 TRUE
# 2 2  2 TRUE
# 3 3 10 TRUE
# 4 4 NA TRUE

虽然我期待这个

# x  y diff
# 1 1  1 FALSE
# 2 2  2 FALSE
# 3 3 10 TRUE
# 4 4 NA TRUE

这是什么原因和/或我将如何解决这个问题所以我仍然可以使用mutate(我真的很喜欢)?

更新

哇,速度有什么不同!

identicalVectorized <- function(x, y) {
  (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))
}

identicalVectorized2 <- function(x, y) {
  sapply(1:length(x), function(ii) {
    !identical(x[ii], y[ii])
  })
}

dat <- data.frame(x = as.numeric(c(1:4,NA, NA)), 
    y = as.numeric(c(1, 2, 10, NA, 15, NA)))

microbenchmark::microbenchmark(
  mutate(dat, diff = identicalVectorized(x, y)),
  mutate(dat, diff = identicalVectorized2(x, y))
)

结果

Unit: microseconds
                                           expr    min     lq     mean median      uq     max neval
  mutate(dat, diff = identicalVectorized(x, y)) 31.672 34.164 38.79999 35.777 37.6825 120.526   100
 mutate(dat, diff = identicalVectorized2(x, y)) 58.064 60.703 66.66150 62.462 72.7260 117.593   100

1 个答案:

答案 0 :(得分:1)

这可能是你最好的选择:

dat <- data.frame(x = c(1:4,NA), y = c(1, 2, 10, NA, 15))
mutate(dat, diff = x != y | is.na(x) | is.na(y))

如果你想要NA == NA为TRUE(它不在R中),请使用:

mutate(dat, diff = (x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y)))

编辑: 如果你想反转真/假,你可以这样做:

  • 将整个东西包裹在parachesis中并放入!在前面所以:

    mutate(dat, diff = !((x != y | (is.na(x) | is.na(y))) & !(is.na(x) & is.na(y))))

  • 或者您可以重新考虑逻辑: mutate(dat, diff = (x == y & !(is.na(x) & !is.na(y)) & !(!is.na(x) & is.na(y)) | (is.na(x) & is.na(y))))