Question

所以，我有很多data.tables我想组合成一个没有重复行的data.table。 “天真”的方法是用唯一的unique(do.call(rbind, list.of.tables))

包装一个rbind调用

这当然有效，但它很慢。在我的实际案例中，表格有两列;哈希字符串和大小。在代码中的这一点上，它们是非键控的。我首先使用哈希键进行键控，但组合的增益被键入的时间所抵消。

以下是我对这些选项进行基准测试的方法：

require(data.table)

makeHash <- function(numberOfHashes) {

  hashspace <- c(0:9, sapply(97:122, function(x) rawToChar(as.raw(x))))
  replicate(numberOfHashes, paste(sample(hashspace, 16), collapse=""))

}

mergeNoKey <- function(tableLength, modCount=tableLength/2) {

  A <- B <- data.table(hash=makeHash(tableLength), size=sample(1:(1024^2), tableLength))

  A[1:modCount] <- data.table(hash=makeHash(modCount), size=sample(1:(1024^2), modCount))

  C <- unique(rbind(A,B))
}

mergeWithKey <- function(tableLength, modCount=tableLength/2) {

  A <- B <- data.table(hash=makeHash(tableLength), size=sample(1:(1024^2), tableLength))

  A[1:modCount] <- data.table(hash=makeHash(modCount), size=sample(1:(1024^2), modCount))

  setkey(A, hash)
  setkey(B, hash)

  C <- unique(rbind(A,B))
}

require(microbenchmark)
m <- microbenchmark(mergeNoKey(1000), mergeWithKey(1000), times=10)
plot(m)

我玩过tableLength和时代，并且在性能方面没有太大差异。我觉得有一个更好的data.table-ish方法来做到这一点。

在实践中，我需要使用许多data.tables，而不是两个，所以可扩展性非常重要;我只是想保持上面的代码简单。

提前致谢！

Answer 1

我认为您要使用rbindlist和unique.data.table ...

C <- unique( rbindlist( list( A , B ) ) )

使用data.tables时替换唯一（rbind（））

1 个答案: