R - 在data.table中过滤字符日期

时间:2015-05-17 15:40:32

标签: r data.table

使用data.tablefread中阅读日期的默认方式是将日期存储为字符值。使用此默认设置,我注意到使用逻辑比较与i运算符在%in%中过滤日期范围的执行时间差异很大:

library(data.table)

CharDateRange <- function(start.date, end.date) {
    sapply(seq(as.Date(start.date), as.Date(end.date), by="days"),
           function (x) format(x, "%Y-%m-%d"))
}

# define a range of dates, represented by a character vector
range.dates <- CharDateRange("2015-01-01", "2015-01-31")

# create example data table
nrows <- 1e7
DT <- data.table(date = sample(range.dates, nrows, replace=T),
                 value = runif(nrows))

%in%操作比逻辑比较快得多:

print(system.time(DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]))
> user  system elapsed 
0.238   0.017   0.254 

print(system.time(DT[date >= "2015-01-10" & date <= "2015-01-17"]))
> user  system elapsed 
6.693   0.018   6.711

你能解释为什么会这样吗?

2 个答案:

答案 0 :(得分:2)

这是预期的,与data.table或日期无关:

 myvec <- rep(c("111111","999999"),1e7)
 mycompvec <- as.character(111111:999999)

 system.time(myvec%in%mycompvec)
 #   user  system elapsed 
 #   1.39    0.08    1.49 
system.time(myvec<="999999"&myvec>="111111")
#    user  system elapsed 
#    9.92    0.03   10.03 

答案 1 :(得分:0)

还应该指出,使用密钥会更快(大约17%的改进,而不是像我预期的那样戏剧化):

DT <- data.table(date = sample(range.dates, nrows, replace=T),
                 value = runif(nrows),key="date")

microbenchmark(times=10,
               DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
               DT[date >= "2015-01-10" & date <= "2015-01-17"],
               DT[.(CharDateRange("2015-01-10", "2015-01-17"))])
Unit: milliseconds
                                                    expr        min         lq       mean     median         uq        max neval cld
 DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]   30.17786   30.90273   33.29402   31.71152   31.99111   42.29018    10  a 
         DT[date >= "2015-01-10" & date <= "2015-01-17"] 4825.18913 4842.19703 4855.27402 4846.98401 4861.02841 4926.22591    10   b
        DT[.(CharDateRange("2015-01-10", "2015-01-17"))]   26.15394   26.77365   30.34439   28.14887   34.97858   35.95498    10  a 

我发现,更大的改进是直接使用日期(尤其使用不等式比较,尽管它们仍然慢得多,因为@Frank指出的原因):

DT2 <- data.table(date=sample(seq(from=as.Date("2015-01-01"),
                                  to=as.Date("2015-01-31"),by="day"),
                              nrows,replace=T),value=runif(nrows),key="date")
microbenchmark(times=10,
               DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
               DT[date >= "2015-01-10" & date <= "2015-01-17"],
               DT[.(CharDateRange("2015-01-10", "2015-01-17"))],
               DT2[date %in% seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day")],
               DT2[date>="2015-01-10"&date<="2015-01-17"],
               DT2[.(seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day"))])
Unit: milliseconds
                                                                                          expr        min         lq       mean     median         uq        max neval
                                       DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]   30.22378   31.17341   32.56766   32.11701   33.53306   37.03804    10
                                               DT[date >= "2015-01-10" & date <= "2015-01-17"] 4856.15109 4877.55814 4952.64332 4910.17639 4952.12055 5337.04256    10
                                              DT[.(CharDateRange("2015-01-10", "2015-01-17"))]   27.32360   27.82355   28.69142   28.74196   29.27730   30.31997    10
 DT2[date %in% seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"),      by = "day")]   23.32938   24.44665   26.11454   25.05308   26.34364   36.58792    10
                                              DT2[date >= "2015-01-10" & date <= "2015-01-17"]  264.96633  272.44326  276.98355  277.07129  279.22478  291.16967    10
        DT2[.(seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"),      by = "day"))]   18.89304   20.83852   20.85754   20.89787   21.05545   21.76082    10