Question

我有一个代码用NA替换数据集中的不可能值。

我正在尝试将代码转换为基于data.table，例如，我将0的高度替换为高度NA

（虚拟）数据

 DT <- data.table(id = 1:5e6, 
                  height = sample(c(0, 100:240), 5e6, replace = TRUE))

我当前的解决方案速度较慢，至少与我的data.frame版本一样冗长。我认为我做错了什么......

DT[height == 0, height := NA]

在研究这个问题的同时，我发现另一个解决方案更快（但更丑）。

set(DT, which("height"==0), "height", value = NA)

所有建议都表示赞赏。

Answer 1

由于v1.9.4，默认情况下 data.table 会在x == val中使用的x %in% val和[.data.table表格的子集中自动为列创建索引打电话。这使得后续子集非常快，只需稍高的价格就可以支付第一个子集（因为data.table的基数排序非常快）。第一个子集可能会更慢，因为是时候：

创建索引
然后是子集。

为了说明这一点（使用@ akrun的数据）：

require(data.table)
getOption("datatable.auto.index") # [1] TRUE ===> enabled

set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))

system.time(DT[height == 0L])
#   0.396   0.059   0.452 ## first run
#   0.003   0.000   0.004 ## second run is very fast

现在，如果我们禁用自动索引：

require(data.table)
options(datatable.auto.index = FALSE)
getOption("datatable.auto.index") # [1] FALSE

set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))

system.time(DT[height == 0L])
#   0.037   0.007   0.042 ## first run
#   0.039   0.010   0.045 ## second run (~ 10x slower than 2nd run above)

options(datatable.auto.index = TRUE) # restore auto indexing if necessary

但是您的情况很特殊，因为您更新了子集的相同列。从本质上讲，这就是正在发生的事情：

i表达式被视为可以针对自动索引进行优化的表达式。稍后会为快速创建的子集创建并保存索引。
看到j表达式，并且列已更新。
已更新已设置索引的列。所以索引被删除了。

如果任何行计算为TRUE，则自动索引逻辑应检测到此情况并完全跳过创建索引，因为创建的索引基本上没用。

您能否在project issues page上提出问题？只需链接到此SO Q即可。

要回答您的问题，请停用自动索引并运行子集，它应该或多或少等于您使用set()时的时间。

Base R解决方案在这里速度不快，因为它只是为了更新这些条目而复制到整个列。但这是因为基地R选择这样做。

Answer 2

我们可以尝试

system.time(DT[, height:= NA^(!height)*height])
#  user  system elapsed 
#  0.03    0.05    0.08

OP的代码

system.time(DT[height == 0, height := NA])
#   user  system elapsed 
#   0.42    0.04    0.49

base R选项应该更快。

system.time(DT$height[DT$height == 0] <- NA)
#   user  system elapsed 
#  0.19    0.05    0.23

和is.na路线

system.time(is.na(DT$height) <- DT$height == 0)
#  user  system elapsed 
#   0.22    0.06    0.28

@ DavidArenburg的建议

system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
#   user  system elapsed 
#   0.06    0.00    0.06

注意：所有这些基准测试都是在每次运行之前通过新创建数据集来完成的，以便提供一些无偏见的基准测试。我可以使用microbenchmark，但每次运行时都会有一些偏差，因为分配在第一次运行中发生。

使用更大的数据集

set.seed(24)
DT <- data.table(id = 1:1e8, 
             height = sample(c(0, 100:240), 1e8, replace = TRUE))

system.time(DT[, height:= NA^(!height)*height])
#  user  system elapsed 
#  0.58    0.24    0.81 

system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
#   user  system elapsed 
#   0.49    0.12    0.61

数据

set.seed(24)
DT <- data.table(id = 1:1e7, 
             height = sample(c(0, 100:240), 1e7, replace = TRUE))

Answer 3

对1亿行进行一次评估的速度测试：

library(data.table)
DT <- data.table(id = 1:1e8, 
                 height = sample(c(0, 100:240), 1e8, replace = TRUE))
DT2 <- copy(DT);DT3 <- copy(DT); DT4 <- copy(DT); DT5 <- copy(DT); DT6 <- copy(DT);DT7 <- copy(DT)
library(microbenchmark)
microbenchmark(
  David    = set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA),
  OP       = DT2[height == 0, height := NA],
  akrun    = setkey(DT3, "height")[.(0), height := NA],
  isna     = {is.na(DT4$height) <- DT4$height == 0},
  assignNA = {DT5$height[DT5$height == 0] <- NA},
  indexset = {setindex(DT6, height); DT6[height==0, height := NA_real_]},
  exponent = DT7[, height:= NA^(!height)*height],
  times=1L
)
# Unit: milliseconds
# expr            min         lq       mean     median         uq        max neval
# David      585.9044   585.9044   585.9044   585.9044   585.9044   585.9044     1
# OP       10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323     1
# akrun    11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951     1
# isna      4843.3623  4843.3623  4843.3623  4843.3623  4843.3623  4843.3623     1
# assignNA  4797.0191  4797.0191  4797.0191  4797.0191  4797.0191  4797.0191     1
# indexset  6307.4564  6307.4564  6307.4564  6307.4564  6307.4564  6307.4564     1
# exponent  1054.6013  1054.6013  1054.6013  1054.6013  1054.6013  1054.6013     1

使用R的data.table用NA替换不可能的值

3 个答案:

数据