使用data.table
在fread
中阅读日期的默认方式是将日期存储为字符值。使用此默认设置,我注意到使用逻辑比较与i
运算符在%in%
中过滤日期范围的执行时间差异很大:
library(data.table)
CharDateRange <- function(start.date, end.date) {
sapply(seq(as.Date(start.date), as.Date(end.date), by="days"),
function (x) format(x, "%Y-%m-%d"))
}
# define a range of dates, represented by a character vector
range.dates <- CharDateRange("2015-01-01", "2015-01-31")
# create example data table
nrows <- 1e7
DT <- data.table(date = sample(range.dates, nrows, replace=T),
value = runif(nrows))
%in%
操作比逻辑比较快得多:
print(system.time(DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]))
> user system elapsed
0.238 0.017 0.254
和
print(system.time(DT[date >= "2015-01-10" & date <= "2015-01-17"]))
> user system elapsed
6.693 0.018 6.711
你能解释为什么会这样吗?
答案 0 :(得分:2)
这是预期的,与data.table
或日期无关:
myvec <- rep(c("111111","999999"),1e7)
mycompvec <- as.character(111111:999999)
system.time(myvec%in%mycompvec)
# user system elapsed
# 1.39 0.08 1.49
system.time(myvec<="999999"&myvec>="111111")
# user system elapsed
# 9.92 0.03 10.03
答案 1 :(得分:0)
还应该指出,使用密钥会更快(大约17%的改进,而不是像我预期的那样戏剧化):
DT <- data.table(date = sample(range.dates, nrows, replace=T),
value = runif(nrows),key="date")
microbenchmark(times=10,
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
DT[date >= "2015-01-10" & date <= "2015-01-17"],
DT[.(CharDateRange("2015-01-10", "2015-01-17"))])
Unit: milliseconds
expr min lq mean median uq max neval cld
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")] 30.17786 30.90273 33.29402 31.71152 31.99111 42.29018 10 a
DT[date >= "2015-01-10" & date <= "2015-01-17"] 4825.18913 4842.19703 4855.27402 4846.98401 4861.02841 4926.22591 10 b
DT[.(CharDateRange("2015-01-10", "2015-01-17"))] 26.15394 26.77365 30.34439 28.14887 34.97858 35.95498 10 a
我发现,更大的改进是直接使用日期(尤其使用不等式比较,尽管它们仍然慢得多,因为@Frank指出的原因):
DT2 <- data.table(date=sample(seq(from=as.Date("2015-01-01"),
to=as.Date("2015-01-31"),by="day"),
nrows,replace=T),value=runif(nrows),key="date")
microbenchmark(times=10,
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
DT[date >= "2015-01-10" & date <= "2015-01-17"],
DT[.(CharDateRange("2015-01-10", "2015-01-17"))],
DT2[date %in% seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day")],
DT2[date>="2015-01-10"&date<="2015-01-17"],
DT2[.(seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day"))])
Unit: milliseconds
expr min lq mean median uq max neval
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")] 30.22378 31.17341 32.56766 32.11701 33.53306 37.03804 10
DT[date >= "2015-01-10" & date <= "2015-01-17"] 4856.15109 4877.55814 4952.64332 4910.17639 4952.12055 5337.04256 10
DT[.(CharDateRange("2015-01-10", "2015-01-17"))] 27.32360 27.82355 28.69142 28.74196 29.27730 30.31997 10
DT2[date %in% seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"), by = "day")] 23.32938 24.44665 26.11454 25.05308 26.34364 36.58792 10
DT2[date >= "2015-01-10" & date <= "2015-01-17"] 264.96633 272.44326 276.98355 277.07129 279.22478 291.16967 10
DT2[.(seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"), by = "day"))] 18.89304 20.83852 20.85754 20.89787 21.05545 21.76082 10