我想知道是否有人可以找到更快的方法来计算向量中的元素组合。我的方法有效,但在向量中有大约600万个元素,速度很慢。
测试向量
test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")
我的方法
lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
预期输出
[[1]]
[1] "335261344015" "335261537633" "344015537633"
[[2]]
[1] "22404132858"
[[3]]
[1] "254654355860" "254654488288" "355860488288"
[[4]]
[1] "219943373817"
[[5]]
[1] "331839404477"
答案 0 :(得分:3)
这个答案比OP在大型测试用例上的解决方案快25x
。它不依赖paste
,而是利用数字和向量化操作的属性。我们还使用comboGeneral
包中的RcppAlgos
(我是作者),它比链接的答案中的combn
和combnPrim
快得多,以生成向量的组合。首先,我们展示comboGeneral
相对于其他函数的效率增益:
## library(gRbase)
library(RcppAlgos)
library(microbenchmark)
microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2),
comboGeneral(300, 2), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
gRbase::combnPrim(300, 2) 5.145654 5.192439 4.83561 7.167839 4.320497 3.98992 100
combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325 100
comboGeneral(300, 2) 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 100
现在,我们创建一个函数来创建一些随机可重现的数据,这些数据将传递给我们的测试函数:
makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
set.seed(mySeed)
sapply(1:vectorSize, function(x) {
paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
})
}
makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111"
[4] "255429 462293 940013" "117488 474997 560332"
看起来不错。现在,让我们看看设置fixed = TRUE
是否能让我们获得任何收益(如上所述@MichaelChirico):
bigVec <- makeTestSet(10, 100000)
microbenchmark(standard = strsplit(bigVec, " "),
withFixed = strsplit(bigVec, " ", fixed = TRUE),
times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639 15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 15
@MichaelChirico是当场的。总而言之,我们得到:
combPairFast <- function(testVec) {
lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
})
}
## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633
[[2]]
[1] 22404132858
[[3]]
[1] 254654355860 254654488288 355860488288
[[4]]
[1] 219943373817
[[5]]
[1] 331839404477
## OP original code
combPairOP <- function(testVec) {
lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}
如OP的评论中所述,最大数量少于一百万(准确地说是600000),这意味着在我们将其中一个数字乘以最多10 ^ 6并将其加到另一个6位数之后数字(相当于简单地连接两个数字串),我们保证在基数R的数值精度范围内(即2^53 - 1
)。这很好,因为对数字数字的算术运算比字符串运算更有效。
剩下的就是基准:
test.vector <- makeTestSet(100, 50)
microbenchmark(combPairOP(test.vector),
combPairFast(test.vector),
times = 20, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342 20
combPairFast(test.vector) 1.00000 1.0000 1.00000 1.00000 1.000 1.00000 20
在较大的载体上:
bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)
## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
any(duplicated(x))
}))
[1] TRUE
system.time(t1 <- combPairFast(bigTest.vector))
user system elapsed
0.303 0.011 0.314
system.time(t2 <- combPairOP(bigTest.vector))
user system elapsed
8.820 0.081 8.902 ### 8.902 / 0.314 ~= 28x faster
## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE