按列的值拆分数据表的快速方法

时间:2016-01-20 20:33:27

标签: r data.table

我有一个大数据表对象,如下所示:

> head(trade_hist,12)
        timestamp   price takeFrom volume
 1: 1448905691000 2440.21      ask  0.346
 2: 1448905691000 2440.11      bid  0.016
 3: 1448905691000 2439.78      ask  0.019
 4: 1448905691000 2440.16      ask  0.470
 5: 1448905691000 2439.59      bid  0.029
 6: 1448905691000 2440.16      bid  0.006
 7: 1448905691000 2439.75      ask  0.045
 8: 1448905691000 2440.12      ask  0.042
 9: 1448905691000 2439.62      bid  0.168
10: 1448905692000 2439.49      ask  0.016
11: 1448905692000 2439.46      ask  0.013
12: 1448905692000 2439.43      bid  0.394

我打算按时间戳将其拆分为数据表列表。我能做到:

> trade_hist_list <- split(trade_hist,f=trade_hist$timestamp)
> trade_hist_list[[1]]
       timestamp   price takeFrom volume
1: 1448905691000 2440.21      ask  0.346
2: 1448905691000 2440.11      bid  0.016
3: 1448905691000 2439.78      ask  0.019
4: 1448905691000 2440.16      ask  0.470
5: 1448905691000 2439.59      bid  0.029
6: 1448905691000 2440.16      bid  0.006
7: 1448905691000 2439.75      ask  0.045
8: 1448905691000 2440.12      ask  0.042
9: 1448905691000 2439.62      bid  0.168

然而,这个过程非常慢,因为时间戳已经排序,它应该快得多。有什么建议? THX!

1 个答案:

答案 0 :(得分:3)

不确定性能但你可以尝试这个。当前提案为split.data.table方法。它可能看起来很复杂,只是因为它处理了一些嵌套的子列表,但这并没有使你的情况有任何不同。

split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
    if(missing(by) && !missing(f)) by = f
    stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x))
    if(!flatten){
        .by = by[1L]
        tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
        setattr(ll <- tmp$.ll, "names", tmp[[.by]])
        if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
    } else {
        tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
        setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
        return(ll)
    }
}

如果您将提供可重现的示例,我可以在该示例中提供函数调用。

好的,我猜......

split.data.table(trade_hist, by="timestamp")

如果您在POSIXct上进行分组,也可能需要使用setNumericRounding(0),但我对此并不确定。