我想知道是否有一个专用函数来检查data.table查询返回的结果集是否为空,即零行。
我尝试检查哪种可用方法更快,而且令人惊讶的是,stock函数nrow()
的使用似乎比在data.table中使用.N
更快。这是由于我在示例中使用的data.table的大小还是一般的true?
dt <- structure(list(Abandon.Period = c(8135L, 1961L, 18307L, 4353L, 2270L, 7905L, 2600L, 2406L, 2286L, 2464L)
, Activity.Flag = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L))
, .Names = c("Random.Number", "Random.Integer")
, row.names = c(NA, -10L), class = c("data.table", "data.frame"))
microbenchmark(
a = nrow(dt[Random.Number < 1 ,]) == 0
,
b = dt[Random.Number < 1 ,.N] == 0
, times = 1000
)
Unit: microseconds
expr min lq mean median uq max neval
a 253.261 281.4845 306.5689 292.8045 309.407 3569.189 1000
b 311.520 334.7630 354.3346 346.8375 361.931 3342.492 1000
编辑:
dt <- data.table(Random.Number = rnorm(100000))
microbenchmark(
a = nrow(dt[Random.Number < 1 ,]) == 0
,
b = dt[Random.Number < 1 ,.N] == 0
, times = 1000
)
Unit: milliseconds
expr min lq mean median uq max neval
a 1.203515 1.287130 1.795557 1.331504 1.438513 85.51352 1000
b 1.021796 1.093638 1.607488 1.128352 1.191289 90.10088 1000
答案 0 :(得分:3)
library(data.table)
library(microbenchmark)
n = 1e7
n_extracol = 20
dt <- data.table(Random.Number = rnorm(n))
dt[, sprintf("%02d", 1:n_extracol) := 22 ]
# test
n = nrow(dt[Random.Number < 1])
all(c(
dt[Random.Number < 1, .N],
dt[, sum(Random.Number < 1)],
length(dt[Random.Number < 1, which = TRUE]),
dt[.(v = 1), on = .(Random.Number < v), .N],
data.table(v = 1)[dt, on = .(v = Random.Number), roll=-Inf, .N, nomatch=0]
) == n) # TRUE
# benchmark
microbenchmark(times = 10,
nrow = nrow(dt[Random.Number < 1])
,
.N = dt[Random.Number < 1, .N]
,
sum = dt[, sum(Random.Number < 1)]
,
len = length(dt[Random.Number < 1, which = TRUE])
,
join = dt[.(v = 1), on = .(Random.Number < v), .N]
,
roll = data.table(v = 1)[dt, on = .(v = Random.Number), roll=-Inf, .N, nomatch=0]
)
我的电脑上的结果:
Unit: milliseconds
expr min lq mean median uq max neval cld
nrow 811.37666 929.83352 963.42572 985.02599 1016.31359 1046.15549 10 b
.N 73.84544 74.26404 79.61228 75.33567 75.71378 120.97063 10 a
sum 44.11742 44.37590 44.64419 44.54316 44.68093 45.53861 10 a
len 69.37396 70.19565 93.39528 70.99561 72.46614 251.16317 10 a
join 856.37441 861.35975 898.08747 871.39156 900.91571 1099.40732 10 b
roll 1469.73950 1478.51737 1513.49030 1487.32068 1499.74617 1699.44766 10 c
很明显nrow
会比.N
更糟糕,因为它需要在计算行之前制作中间表(包含所有列)。我不知道为什么roll
和join
如此糟糕,但我猜他们可能会在以后看到进一步的优化。
总和(@ akrun的想法)甚至比.N
更快,但不是我喜欢的风格。此外,.N
方式可能会在auto-indexing with inequalities实施后赢得胜利。无论如何,我总是像OP一样进行测试,通常就像
DT[query, stopifnot(.N == 0L)]
如果您完全不关心除了与零进行比较之外的行数,您可以使用any
代替sum
,但在此示例中它只会快一点。 / p>