我有一个data.table,其中一些ID粘贴在一起作为单个字符列,以下划线分隔。我正在尝试将id分成不同的列,但我的最佳方法对于我的大型数据集(~250M行)来说真的很慢。有趣的是,该操作似乎没有花费O(N)时间,这是我所期望的。换句话说,直到大约50M +行然后变得非常慢,它才会非常快。
制作一些数据
require(data.table)
set.seed(2016)
sim_rows <- 40000000
dt <- data.table(
LineId = rep("L0123", times=sim_rows),
StationId = rep("S0123", times=sim_rows),
TimeId = rep("T0123", times=sim_rows)
)
dt[, InfoId := paste(LineId, StationId, TimeId, sep="_")]
dt[, c("LineId", "StationId", "TimeId") := NULL]
gc(reset=T) # free up 1.5Gb of memory
dt
InfoId
1: L0123_S0123_T0123
2: L0123_S0123_T0123
3: L0123_S0123_T0123
4: L0123_S0123_T0123
5: L0123_S0123_T0123
---
39999996: L0123_S0123_T0123
39999997: L0123_S0123_T0123
39999998: L0123_S0123_T0123
39999999: L0123_S0123_T0123
40000000: L0123_S0123_T0123
检查时间
system.time( dt[1:10000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
5.179 0.634 3.867
system.time( dt[1:20000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
7.805 0.958 7.703
system.time( dt[1:30000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
12.556 1.782 12.349
system.time( dt[1:40000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
29.260 2.822 29.895
检查结果
dt
InfoId LineId StationId TimeId
1: L0123_S0123_T0123 L0123 S0123 T0123
2: L0123_S0123_T0123 L0123 S0123 T0123
3: L0123_S0123_T0123 L0123 S0123 T0123
4: L0123_S0123_T0123 L0123 S0123 T0123
5: L0123_S0123_T0123 L0123 S0123 T0123
---
39999996: L0123_S0123_T0123 L0123 S0123 T0123
39999997: L0123_S0123_T0123 L0123 S0123 T0123
39999998: L0123_S0123_T0123 L0123 S0123 T0123
39999999: L0123_S0123_T0123 L0123 S0123 T0123
40000000: L0123_S0123_T0123 L0123 S0123 T0123
我怎样才能加快这个孩子的速度?
答案 0 :(得分:3)
stringr
较新,内部基于stringi
,通常为even faster。
此外,stringi和较小程度的stringr都有每个字符串操作的多个变体(fixed/coll/regex/words/boundaries/charclass
),这些变体针对操作数的类型进行了优化。
尝试stri_split_fixed(..., '_')
,它应该非常快。
require(stringi)
> system.time( dt[1:1e6, c("LineId", "StationId", "TimeId") := stri_split_fixed(InfoId, "_")] )
user system elapsed
2.635 0.497 3.379 # on my old machine; please tell us your numbers?
答案 1 :(得分:1)
一个选项是来自stri_split
stringi
library(stringi)
dt1 <- copy(dt)
system.time( dt[1:40000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
# user system elapsed
# 41.20 1.03 42.39
system.time( dt1[1:40000000, c("LineId", "StationId", "TimeId") :=
transpose(stri_split(InfoId, fixed = "_"))] )
# user system elapsed
# 28.78 0.98 29.74