Question

我想知道是否有人可以找到更快的方法来计算向量中的元素组合。我的方法有效，但在向量中有大约600万个元素，速度很慢。

测试向量

test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")

我的方法

lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))

预期输出

[[1]]
[1] "335261344015" "335261537633" "344015537633"

[[2]]
[1] "22404132858"

[[3]]
[1] "254654355860" "254654488288" "355860488288"

[[4]]
[1] "219943373817"

[[5]]
[1] "331839404477"

Answer 1

这个答案比OP在大型测试用例上的解决方案快25x。它不依赖paste，而是利用数字和向量化操作的属性。我们还使用comboGeneral包中的RcppAlgos（我是作者），它比链接的答案中的combn和combnPrim快得多，以生成向量的组合。首先，我们展示comboGeneral相对于其他函数的效率增益：

## library(gRbase)
library(RcppAlgos)
library(microbenchmark)

microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2), 
               comboGeneral(300, 2), unit = "relative")
Unit: relative
                     expr        min         lq      mean     median         uq       max neval
gRbase::combnPrim(300, 2)   5.145654   5.192439   4.83561   7.167839   4.320497   3.98992   100
            combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325   100
     comboGeneral(300, 2)   1.000000   1.000000   1.00000   1.000000   1.000000   1.00000   100

现在，我们创建一个函数来创建一些随机可重现的数据，这些数据将传递给我们的测试函数：

makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
    set.seed(mySeed)
    sapply(1:vectorSize, function(x) {
        paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
    })
}

makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111" 
[4] "255429 462293 940013" "117488 474997 560332"

看起来不错。现在，让我们看看设置fixed = TRUE是否能让我们获得任何收益（如上所述@MichaelChirico）：

bigVec <- makeTestSet(10, 100000)

microbenchmark(standard = strsplit(bigVec, " "), 
               withFixed = strsplit(bigVec, " ", fixed = TRUE), 
               times = 15, unit = "relative")
Unit: relative
     expr      min       lq     mean   median       uq      max neval
 standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639    15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    15

@MichaelChirico是当场的。总而言之，我们得到：

combPairFast <- function(testVec) {
    lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
        combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
        unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
    })
}

## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633

[[2]]
[1] 22404132858

[[3]]
[1] 254654355860 254654488288 355860488288

[[4]]
[1] 219943373817

[[5]]
[1] 331839404477

## OP original code
combPairOP <- function(testVec) {
    lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}

如OP的评论中所述，最大数量少于一百万（准确地说是600000），这意味着在我们将其中一个数字乘以最多10 ^ 6并将其加到另一个6位数之后数字（相当于简单地连接两个数字串），我们保证在基数R的数值精度范围内（即2^53 - 1）。这很好，因为对数字数字的算术运算比字符串运算更有效。

剩下的就是基准：

test.vector <- makeTestSet(100, 50)

microbenchmark(combPairOP(test.vector), 
               combPairFast(test.vector),
               times = 20, unit = "relative")
Unit: relative
                     expr      min      lq     mean   median     uq      max neval
  combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342    20
combPairFast(test.vector)  1.00000  1.0000  1.00000  1.00000  1.000  1.00000    20

在较大的载体上：

bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)

## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
    any(duplicated(x))
}))
[1] TRUE

system.time(t1 <- combPairFast(bigTest.vector))
 user  system elapsed 
0.303   0.011   0.314 

system.time(t2 <- combPairOP(bigTest.vector))
 user  system elapsed 
8.820   0.081   8.902    ### 8.902 / 0.314 ~= 28x faster

## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE

优化组合计算到列表 - 大数据集

1 个答案: