我有一个大数据表对象,如下所示:
> head(trade_hist,12)
timestamp price takeFrom volume
1: 1448905691000 2440.21 ask 0.346
2: 1448905691000 2440.11 bid 0.016
3: 1448905691000 2439.78 ask 0.019
4: 1448905691000 2440.16 ask 0.470
5: 1448905691000 2439.59 bid 0.029
6: 1448905691000 2440.16 bid 0.006
7: 1448905691000 2439.75 ask 0.045
8: 1448905691000 2440.12 ask 0.042
9: 1448905691000 2439.62 bid 0.168
10: 1448905692000 2439.49 ask 0.016
11: 1448905692000 2439.46 ask 0.013
12: 1448905692000 2439.43 bid 0.394
我打算按时间戳将其拆分为数据表列表。我能做到:
> trade_hist_list <- split(trade_hist,f=trade_hist$timestamp)
> trade_hist_list[[1]]
timestamp price takeFrom volume
1: 1448905691000 2440.21 ask 0.346
2: 1448905691000 2440.11 bid 0.016
3: 1448905691000 2439.78 ask 0.019
4: 1448905691000 2440.16 ask 0.470
5: 1448905691000 2439.59 bid 0.029
6: 1448905691000 2440.16 bid 0.006
7: 1448905691000 2439.75 ask 0.045
8: 1448905691000 2440.12 ask 0.042
9: 1448905691000 2439.62 bid 0.168
然而,这个过程非常慢,因为时间戳已经排序,它应该快得多。有什么建议? THX!
答案 0 :(得分:3)
不确定性能但你可以尝试这个。当前提案为split.data.table
方法。它可能看起来很复杂,只是因为它处理了一些嵌套的子列表,但这并没有使你的情况有任何不同。
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
if(missing(by) && !missing(f)) by = f
stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x))
if(!flatten){
.by = by[1L]
tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
setattr(ll <- tmp$.ll, "names", tmp[[.by]])
if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
} else {
tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
return(ll)
}
}
如果您将提供可重现的示例,我可以在该示例中提供函数调用。
好的,我猜......
split.data.table(trade_hist, by="timestamp")
如果您在POSIXct上进行分组,也可能需要使用setNumericRounding(0)
,但我对此并不确定。