Question

我有这样的数据：

library(data.table)
NN = 10000000
set.seed(32040)
DT <- data.table(
  col = 1:10000000,
  timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)

我正在尝试将唯一的年份和星期作为代码，以便我可以对重复项进行排序（实际数据表具有userID以及更多内容）。我有一个目前可用的解决方案（如下），但是对于从日期列中唯一粘贴几周和一年的部分来说，它很慢。使用anytime包创建日期并从week提取year和lubridate的日期仍然非常快。有人可以帮我加快速度吗？谢谢！

我的慢速代码（可以，但是我想加快速度）：

library(anytime)
library(lubridate)
tz<-"Africa/Addis_Ababa"
DT$localtime<-  anytime(DT$timestamp, tz=tz) ###Lightning fast
DT$weekuni <- paste(year(DT$localtime),week(DT$localtime),sep="") ###super slow

我的测试表明，paste杀死了我：

非常快anytime转换为日期：

system.time(DT$localtime<-  anytime(DT$timestamp, tz=tz)) ###Lightning fast
       user  system elapsed 
      0.264   0.417   0.933

从日期开始，lubridate周和年的转换很快，而paste则是缓慢的：

> system.time(DT$weekuni1 <- week(DT$localtime)) ###super slow
   user  system elapsed 
  1.203   0.188   1.400 
> system.time(DT$weekuni2 <- year(DT$localtime))
   user  system elapsed 
  1.229   0.189   1.427 
> system.time(DT$weekuni <- paste0(DT$weekuni1,dt$weekuni2))
   user  system elapsed 
 14.652   0.344  15.483

Answer 1

我使用format而不是paste使您的代码运行速度提高了约50％。

首先，对于您的用例，我不确定anytime的意义，因为我们几乎可以立即将时间戳放入POSIXct结构中：

DT[ , localtime := .POSIXct(timestamp, tz = tz)]

接下来，我在?strptime上搜索了基于ISO周的格式代码，以获得：

DT[ , weekuni := format(localtime, format = '%G%V')]

我不确定100％是否总是与paste(year, week)相同，但这是用于您的测试数据；如果它们之间有什么区别，您应该问一下这是否真的很重要。

我唯一想到的可能更快的是在时间戳本身上使用整数算术。如果Africa/Addis_Ababa时区在您的示例时间范围内未对其UTC偏移量进行任何调整，则这实际上会更容易（不幸的是，看起来Africa/Addis_Ababa遵循夏时制，因此UTC偏移量在2与3小时，使整数算术方法变得更加困难）

作为记录，使用data.table::year和data.table::week的速度与此处使用的方法差不多，但是它使用了“年”和“周”的不同定义。比lubridate（默认情况下使用%G%V上面执行的ISO年/周）

data.table尚无isoyear的实现，并且data.table::isoweek比lubridate::week慢得多。

Answer 2

如果您只想根据日期来定义一年的星期，则可以得到比原来快20倍的解决方案：

library(data.table)
NN = 10000000
# NN = 1e4
set.seed(32040)
DT <- data.table(
  col = seq_len(NN),
  timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
DT1 <- copy(DT)

DT2 <- copy(DT)
tz <- "Africa/Addis_Ababa"

old <- function(DT) {
  DT$localtime<-  anytime::anytime(DT$timestamp, tz=tz) ###Lightning fast
  DT$weekuni <- paste(lubridate::year(DT$localtime), lubridate::week(DT$localtime), sep="")
  DT[, timestamp := NULL]
  DT[, .(col, localtime, weekuni)]
}

new <- function(DT) {
  DT[ , localtime := anytime::anytime(timestamp, tz = tz)]
  DT[, Date := as.Date(localtime)]
  DT[, weekuni := paste0(lubridate::year(.BY[[1L]]), lubridate::week(.BY[[1L]])),
     keyby = "Date"]
  DT[, Date := NULL]
  # DT[, timestamp := NULL]
  DT[order(col), .(col, localtime, weekuni)]
}

bench::mark(old(DT1), new(DT2), check = FALSE, filter_gc = FALSE)
#> # A tibble: 2 x 10
#>   expression     min    mean median    max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:t> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>
#> 1 old(DT1)    22.39s  22.39s 22.39s 22.39s    0.0447    2.28GB     5     1
#> 2 new(DT2)     1.13s   1.13s  1.13s  1.13s    0.888   878.12MB     1     1
#> # ... with 1 more variable: total_time <bch:tm>

^{由reprex package（v0.2.0）于2018-06-23创建。}

即使您不这样做，也可以通过每个日期只使用一次paste来获得10倍的加速：

library(data.table)
NN = 1e7
# NN = 1e4
set.seed(32040)
DT <- data.table(
  col = seq_len(NN),
  timestamp = 1521872652 + sample(7000001, NN, replace = TRUE)
)
DT1 <- copy(DT)

DT2 <- copy(DT)
DT3 <- copy(DT)
tz <- "Africa/Addis_Ababa"

old <- function(DT) {
  DT$localtime<-  anytime::anytime(DT$timestamp, tz=tz) ###Lightning fast
  DT$weekuni <- paste(lubridate::year(DT$localtime), lubridate::week(DT$localtime), sep="")
  DT[, timestamp := NULL]
  DT[, .(col, weekuni)]
}

new <- function(DT) {
  DT[ , Date := anytime::anydate(timestamp, tz = tz)]
  DT[, weekuni := paste0(lubridate::year(.BY[[1L]]), lubridate::week(.BY[[1L]])),
     keyby = "Date"]
  DT[, Date := NULL]
  # DT[, timestamp := NULL]
  setorderv(DT[, .(col, weekuni)], "col")
}


bench::mark(old(DT1), new(DT2), check = TRUE, filter_gc = FALSE)
#> # A tibble: 2 x 10
#>   expression     min    mean median    max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:t> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>
#> 1 old(DT1)     22.2s   22.2s  22.2s  22.2s    0.0450    2.21GB     4     1
#> 2 new(DT2)      2.8s    2.8s   2.8s   2.8s    0.357     1.42GB     3     1
#> # ... with 1 more variable: total_time <bch:tm>

提高R中data.table中两列的粘贴速度（可重现）

2 个答案: