提高R功能的效率和速度

时间:2013-10-02 17:45:38

标签: r for-loop

使用R时,我始终牢记:“尽可能避免使用循环”。但是,我现在陷入困境,我无法找到一种奇妙的方式来编码我需要的东西。

为了记录,经过几次评论,我上面的陈述不是正确的陈述,没有必要避免循环来提高效率。

我有两个字符串向量作为输入,让我们称之为ab - 它们只能包含字母"M""I"和{{1} }。

"D"

我想要的输出是:

a = c("M","I","D","D","M","M","M","M","M","M")
b = c("M","M","M","M","M","M","D","M","M")

以下函数为我提供了这样的输出:

d = c("M","I","D","D","M","M","M","M","I","M","M")

函数逻辑如下,只要my.function <- function(a, b) { nrow.df = length(a) + length(which(b=="D")) my.df = data.frame(a = rep(NA, nrow.df), b = rep(NA, nrow.df), d = rep(NA, nrow.df)) my.df$a[1:length(a)] = a my.df$b[1:length(b)] = b for (i in 1:nrow.df) { if(my.df$a[i] == "D") { my.df$d[i] = "D" my.df$b[(i+1):nrow.df] = my.df$b[i:(nrow.df-1)] } else if (my.df$b[i] == "D") { my.df$d[i] = "I" my.df$a[(i+1):nrow.df] = my.df$a[i:(nrow.df-1)] } else if (my.df$a[i] == "I") { my.df$d[i] = "I" } else if (my.df$b[i] == "I") { my.df$d[i] = "D" } else { my.df$d[i] = my.df$a[i] } } return(my.df$d) } > d = my.function(a,b) > d [1] "M" "I" "D" "D" "M" "M" "M" "M" "I" "M" "M" 中有"D",它就会a放入"D"并将向量d移位1,反之亦然,只要b中有"D",它就会b放入"I"并将d换成1.

接下来,如果a中有"I",而a中只有"D",则b放入"I",反之亦然,只要a中有"I"b中不是"D"a就会"D"。否则,d

这不是一个复杂的功能,但我正在努力使其高效率。我使用mclapply将这个函数应用了数百万次,因此快速实现这样的函数可以节省很多时间。

您推荐使用Rcpp吗?会更快吗?与Cpp数百万时间沟通R是否有任何减速,或者它只是与Rcpp一起自动?

4 个答案:

答案 0 :(得分:4)

我没有看到任何简单的方法来避免循环。但是,仍然有一种更有效的方法。问题是,每次遇到角色a时,您实际上正在移动bD,并且移动这样的向量是O(n)操作,因此正在运行这个循环的时间实际上是O(n^2)

您可以简化代码并获得更好的性能,如下所示:

f<-function(a,b){
 aSkipped<-0
 bSkipped<-0
 d<-rep(0,length(a)+sum(b=="D"))

 for(i in 1:length(d)){

    if(a[i-aSkipped] == "D") {
      d[i] = "D"
      bSkipped<-bSkipped+1
    } else if (b[i-bSkipped] == "D") {
      d[i] = "I"
      aSkipped<-aSkipped+1
    } else if (a[i-aSkipped] == "I") {
      d[i] = "I"
    } else if (b[i-bSkipped] == "I") {
      d[i] = "D"
    } else {
      d[i] = a[i-aSkipped]
    }
  }
  d
}

编辑。当输入变大时,您将真正看到大的性能改进。对于小字符串,而不是太多的“D”这个和Ananda Mahto的解决方案大约在同一时间运行:

> set.seed(123)
> a<-c(sample(c("M","I"),500,T))
> b<-c(sample(c("M","I"),500,T))
> a[sample(500,50)]<-"D"
> b[sample(500,50)]<-"D"
> microbenchmark(f(a,b),my.function.v(a,b))
Unit: milliseconds
                expr      min       lq   median       uq      max neval
             f(a, b) 4.259970 4.324046 4.368018 4.463925 9.694951   100
 my.function.v(a, b) 4.442873 4.497172 4.533196 4.639543 9.901044   100

但对于长度为50000且长度为5000“D”的字符串,差异很大:

> set.seed(123)
> a<-c(sample(c("M","I"),50000,T))
> b<-c(sample(c("M","I"),50000,T))
> a[sample(50000,5000)]<-"D"
> b[sample(50000,5000)]<-"D"
> system.time(f(a,b))
   user  system elapsed 
  0.460   0.000   0.463 
> system.time(my.function.v(a,b))
   user  system elapsed 
  7.056   0.008   7.077 

答案 1 :(得分:4)

根据我的评论,如果速度是一个问题,第1步是不要不必要地使用data.frame。这个答案没有解决循环问题(正如其他人已经说过的那样,如果正确完成,在R中使用循环没有任何问题。)

以下是您的函数的 非常轻微 修改版本,使用vector而不是data.frame来存储数据。

my.function.v <- function(a, b) {
  nrow.df = length(a) + length(which(b=="D"))
  A <- B <- D <- vector(length = nrow.df)
  A[1:length(a)] = a
  B[1:length(b)] = b
  for (i in 1:nrow.df)
  {
    if(A[i] == "D") {
      D[i] = "D"
      B[(i+1):nrow.df] = B[i:(nrow.df-1)]
    } else if (B[i] == "D") {
      D[i] = "I"
      A[(i+1):nrow.df] = A[i:(nrow.df-1)]
    } else if (A[i] == "I") {
      D[i] = "I"
    } else if (B[i] == "I") {
      D[i] = "D"
    } else {
      D[i] = A[i]
    }
  }
  return(D)
}

请注意以下速度的相对差异:

library(microbenchmark)
microbenchmark(my.function(a, b), my.function.v(a, b), f(a, b))
# Unit: microseconds
#                 expr      min        lq    median        uq      max neval
#    my.function(a, b) 1448.416 1490.8780 1511.3435 1547.3880 6674.332   100
#  my.function.v(a, b)  157.248  165.8725  171.6475  179.1865  324.722   100
#              f(a, b)  168.874  177.5455  184.8775  193.3455  416.551   100

可以看出,@ mrip的功能也比原来的功能好得多。

答案 2 :(得分:2)

好的,这是Rcpp解决方案,正如预期的那样,它大大超过了R解决方案:

rcppFun<-"
CharacterVector fcpp(CharacterVector a,CharacterVector b,int size){
int aSkipped = 0;
int bSkipped = 0;
int asize = a.size();
Rcpp::CharacterVector d(size);
for(int i=0; i<size; i++){
    if(i-aSkipped<asize && a[i-aSkipped][0] == 'D') {
      d[i] = \"D\";
      bSkipped++;
    } else if (b[i-bSkipped][0] == 'D') {
      d[i] = \"I\";
      aSkipped++;
    } else if (a[i-aSkipped][0] == 'I') {
      d[i] = \"I\";
    } else if (b[i-bSkipped][0] == 'I') {
      d[i] = \"D\";
    } else {
      d[i] = a[i-aSkipped];
    }
}
 return d;
}"
require("Rcpp")
fcpp<-cppFunction(rcppFun)

f3<-function(a,b){
  fcpp(a,b,as.integer(length(a)+sum(b=="D")))
}

警告:该功能根本不进行参数检查,因此如果您输入错误数据,则很容易出现seg错误。

如果你要打电话给我,Rcpp肯定是要走的路:

> with(ab(10),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr     min       lq   median       uq     max neval
             f(a, b) 103.993 107.5155 108.6815 109.7455 178.801   100
            f3(a, b)   7.354   8.1305   8.5575   9.1220  18.014   100
            f2(a, b)  87.081  90.4150  92.2730  94.2585 146.502   100
 my.function.v(a, b)  84.389  86.5140  87.6090  88.8340 109.106   100
> with(ab(100),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr     min        lq    median        uq      max neval
             f(a, b) 992.082 1018.9850 1032.0180 1071.0690 2784.710   100
            f3(a, b)  12.873   14.3605   14.7370   15.5095   35.582   100
            f2(a, b) 119.396  125.4405  129.3015  134.9915 1909.930   100
 my.function.v(a, b) 769.618  786.7865  802.2920  824.0820  905.737   100
> with(ab(1000),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr      min        lq     median        uq       max neval
             f(a, b) 9816.295 10065.065 10233.1350 10392.696 12383.373   100
            f3(a, b)   66.057    67.869    83.9075    87.231  1167.086   100
            f2(a, b) 1637.972  1760.258  2667.6985  3138.229 47610.317   100
 my.function.v(a, b) 9692.885 10272.425 10997.2595 11402.602 54315.922   100
> with(ab(10000),microbenchmark(f(a,b),f3(a,b),f2(a,b)))
Unit: microseconds
     expr        min         lq      median          uq        max neval
  f(a, b) 101644.922 103311.678 105185.5955 108342.4960 144620.777   100
 f3(a, b)    607.702    610.039    669.8515    678.1845    785.415   100
 f2(a, b) 221305.641 247952.345 254478.1580 341195.5510 656408.378   100
> 

答案 3 :(得分:1)

只是为了展示它是如何完成的,它可以在R中没有循环的情况下完成;这是一种方式。当长度大约为1000或更小时更快但在更大时更慢。一个要点是,你肯定可以在Rcpp中加快速度。

f2 <- function(a,b) {
  da <- which(a=="D")
  db <- which(b=="D")
  dif <- outer(da, db, `<`) 
  da <- da + rowSums(!dif)
  db <- db + colSums(dif)
  ia <- which(a=="I")  
  ia <- ia + colSums(outer(db, ia, `<`))
  ib <- which(b=="I")
  ib <- ib + colSums(outer(da, ib, `<`))
  out <- rep("M", length(a) + length(db))
  out[da] <- "D"
  out[db] <- "I"
  out[ia] <- "I"
  out[ib] <- "D"
  out
}

用于生成数据

ab <- function(N) {
  set.seed(123)
  a<-c(sample(c("M","I"),N,TRUE))
  b<-c(sample(c("M","I"),N,TRUE))
  a[sample(N,N/10)]<-"D"
  b[sample(N,N/10)]<-"D"
  list(a=a,b=b)
}

时序:

> library(microbenchmark)
> with(ab(10), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
                expr    min       lq   median       uq     max neval
 my.function.v(a, b) 79.102  86.9005  89.3680  93.2410 279.761   100
             f(a, b) 84.334  91.1055  94.1790  98.2645 215.579   100
            f2(a, b) 94.807 101.5405 105.1625 108.9745 226.149   100

> with(ab(100), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
                expr     min       lq  median       uq      max neval
 my.function.v(a, b) 732.849 750.4480 762.906 845.0835 1953.371   100
             f(a, b) 789.380 805.8905 819.022 902.5865 1921.064   100
            f2(a, b) 124.442 129.1450 134.543 137.5910  237.498   100

> with(ab(1000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: milliseconds
                expr       min        lq    median        uq      max neval
 my.function.v(a, b) 10.146865 10.387144 10.695895 11.123164 13.08263   100
             f(a, b)  7.776286  7.973918  8.266882  8.633563  9.98204   100
            f2(a, b)  1.322295  1.355601  1.385302  1.465469  1.85349   100

> with(ab(10000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b), times=10))
Unit: milliseconds
                expr      min        lq    median        uq       max neval
 my.function.v(a, b) 429.4030 435.00373 439.06706 442.51650 465.00124    10
             f(a, b)  80.7709  83.71715  85.14887  88.02067  89.00047    10
            f2(a, b) 164.7807 170.37608 175.94281 247.78353 251.14653    10