Question

我在R中遇到了很大的性能问题。我编写了一个迭代data.frame对象的函数。它只是向data.frame添加一个新列并积累一些东西。（操作简单）。 data.frame大约有850K行。我的电脑仍在工作（现在大约10小时），我不知道运行时间。

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

有关如何加快此操作的想法吗？

Answer 1

最大的问题和无效的根源是索引data.frame，我的意思是你使用temp[,]的所有这些行。
尽量避免这种情况。我接受了你的功能，更改了索引，并在这里 version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp$`Kumm.` <- res
    return(temp)
}

如您所见，我创建了收集结果的向量res。最后我将它添加到data.frame，我不需要弄乱名字。那么它有多好？

我为data.frame的每个函数运行nrow，1,000到10,000乘1,000，并使用system.time

衡量时间

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果是

performance

您可以看到您的版本取决于nrow(X)的指数级。修改后的版本具有线性关系，而简单的lm模型预测，对于850,000行，计算需要6分10秒。

矢量化的力量

Shane和Calimo在他们的答案中指出，矢量化是提高绩效的关键。从您的代码中，您可以移出循环：

调节
初始化结果（temp[i,9]）

这导致了这段代码

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

比较此函数的结果，这次nrow的结果为10,000到100,000 10,000。

performance

调整已调整的

另一个调整是将循环索引temp[i,9]更改为res[i]（在第i次循环迭代中完全相同）。索引矢量和索引data.frame之间的区别第二件事：当您查看循环时，您可以看到无需遍历所有i，但仅适用于符合条件的那些。所以我们走了

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

您获得的高性能取决于数据结构。准确地说 - 条件中TRUE值的百分比。对于我的模拟数据，它需要在一秒钟以下850,000行的计算时间。

performance

我希望你能走得更远，我看到至少有两件事可以做：

编写C代码来执行条件cumsum

如果您知道数据中的最大序列不是很大，那么您可以将循环更改为矢量化，类似

while (any(cond)) {
    indx <- c(FALSE, cond[-1] & !cond[-n])
    res[indx] <- res[indx] + res[which(indx)-1]
    cond[indx] <- FALSE
}

用于模拟和数字的代码是available on GitHub。

Answer 2

加速R代码的一般策略

首先，弄清楚哪里慢的部分。没有必要优化运行缓慢的代码。对于少量代码，只需通过思考即可。如果失败，RProf和类似的分析工具可能会有所帮助。

一旦弄清楚瓶颈，请考虑更高效的算法来做你想做的事情。如果可能，计算应该只运行一次，所以：

存储结果并访问它们而不是反复重新计算
从循环中取出非循环相关的计算
避免不必要的计算（例如don't use regular expressions with fixed searches will do）

使用更多高效功能可以产生中等或大的速度增益。例如，paste0产生的效率很小，但是.colSums()及其亲属产生了更明显的收益。 mean为particularly slow。

然后你可以避免一些特别<强烈>常见的麻烦：

cbind会让你很快放慢速度。
初始化您的数据结构，然后将其填入rather than expanding them each time。
即使预先分配，您也可以切换到传递参考方法而不是按值传递方法，但可能不值得麻烦。
请查看R Inferno以避免更多陷阱。

尝试更好的矢量化，这通常但不总是有帮助。在这方面，像ifelse，diff等固有的矢量化命令将提供比apply系列命令更多的改进（这些命令几乎不提供快速提升循环）。

您还可以尝试向R功能提供更多信息。例如，使用vapply rather than sapply，并指定colClasses when reading in text-based data。速度增益将根据您消除的猜测量而变化。

接下来，考虑优化软件包：data.table软件包可以在可能的情况下，数据处理和读取大量数据时产生大量的速度提升（fread ）。

接下来，通过更有效的方式调用R 来尝试提高速度：

编译您的R脚本。或者同时使用Ra和jit个软件包进行即时编译（Dirk在this presentation中有一个示例）。
确保您使用的是优化的BLAS。这些提供了全面的速度提升。老实说，R在安装时不会自动使用最有效的库，这是一种遗憾。希望Revolution R能够将他们在这里所做的工作贡献给整个社区。
Radford Neal做了很多优化，其中一些被R Core采用，还有许多被分成pqR。

最后，如果以上所有内容仍无法满足您的需求，您可能需要使用更快的语言来获取慢速代码段。这里Rcpp和inline的组合使得用C ++代码替换算法中最慢的部分变得特别容易。例如，这里是my first attempt at doing so，它甚至吹走了高度优化的R解决方案。

如果你在这之后仍然遇到麻烦，你只需要更多的计算能力。查看并行化（http://cran.r-project.org/web/views/HighPerformanceComputing.html）甚至是基于GPU的解决方案（gpu-tools）。

指向其他指南的链接

http://www.noamross.net/blog/2013/4/25/faster-talk.html

Answer 3

如果您正在使用for循环，那么您很可能将R编码为C或Java或其他内容。正确矢量化的R代码非常快。

以这两个简单的代码位为例，按顺序生成10,000个整数的列表：

第一个代码示例是如何使用传统编码范例对循环进行编码。完成需要28秒

system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61

通过预先分配内存的简单操作，您可以获得近100倍的改进：

system.time({
    a <- rep(1, 1e5)
    for(i in 1:1e5)a[i] <- i
})

   user  system elapsed 
   0.30    0.00    0.29

但是使用冒号运算符:使用基本R向量运算这个操作几乎是瞬间的：

system.time(a <- 1:1e5)

   user  system elapsed 
      0       0       0

Answer 4

通过使用索引或嵌套的ifelse()语句跳过循环，可以更快地完成此操作。

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."

Answer 5

正如Ari在答案结束时提到的那样，Rcpp和inline软件包使得快速创建非常容易。例如，尝试此inline代码（警告：未经测试）：

body <- 'Rcpp::NumericMatrix nm(temp);
         int nrtemp = Rccp::as<int>(nrt);
         for (int i = 0; i < nrtemp; ++i) {
             temp(i, 9) = i
             if (i > 1) {
                 if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
                     temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
                 } else {
                     temp(i, 9) = temp(i, 8)
                 }
             } else {
                 temp(i, 9) = temp(i, 8)
             }
         return Rcpp::wrap(nm);
        '

settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
    plugin="Rcpp", settings=settings, cppargs="-I/usr/include")

dayloop2 <- function(temp) {
    # extract a numeric matrix from temp, put it in tmp
    nc <- ncol(temp)
    nm <- dayloop(nc, temp)
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

#include事物的类似程序，您只需传递参数

inc <- '#include <header.h>

到cxxfunction，为include=inc。真正酷的是它为你完成了所有的链接和编译，因此原型设计非常快。

免责声明：我不完全确定tmp类应该是数字而不是数字矩阵或其他东西。但我很确定。

编辑：如果此后仍需要更快的速度，OpenMP是适用于C++的并行化工具。我没有尝试过使用inline，但它应该可行。在n核心的情况下，想法是k执行循环迭代k % n。 Matloff的 The Programming of R Programming 中提供了一个合适的介绍，可用here，在第16章求助于C 。

Answer 6

我不喜欢重写代码......当然，ifelse和lapply是更好的选择，但有时很难做到这一点。

我经常使用data.frames，因为我会使用df$var[i]

等列表

这是一个组成的例子：

nrow=function(x){ ##required as I use nrow at times.
  if(class(x)=='list') {
    length(x[[names(x)[1]]])
  }else{
    base::nrow(x)
  }
}

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
})

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  d=as.list(d) #become a list
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  d=as.data.frame(d) #revert back to data.frame
})

data.frame version：

   user  system elapsed 
   0.53    0.00    0.53

列表版本：

   user  system elapsed 
   0.04    0.00    0.03

使用向量列表比使用data.frame快17倍。

关于为什么内部数据框架在这方面如此缓慢的任何评论？人们会认为它们像列表一样运作......

对于更快的代码，请执行此class(d)='list'而不是d=as.list(d)和class(d)='data.frame'

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  class(d)='list'
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  class(d)='data.frame'
})
head(d)

Answer 7

这里的答案很棒。未涉及的一个小方面是问题表明＆＃34; 我的电脑仍在工作（现在大约10小时），我不知道运行时＆＃34;。我总是在开发时将以下代码放入循环中，以了解更改如何影响速度以及监视完成所需的时间。

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
    # do stuff
  }
  return(blah)
}

也适用于lapply。

dayloop2 <- function(temp){
  temp <- lapply(1:nrow(temp), function(i) {
    cat(round(i/nrow(temp)*100,2),"%    \r")
    #do stuff
  })
  return(temp)
}

如果循环中的函数非常快但循环次数很多，那么考虑只是经常打印，因为打印到控制台本身会产生开销。 e.g。

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
    # do stuff
  }
  return(temp)
}

Answer 8

在R中，您通常可以使用apply族函数加速循环处理（在您的情况下，它可能是replicate）。查看提供进度条的plyr包。

另一个选择是完全避免循环并用矢量化算术替换它们。我不确定你到底在做什么，但你可以将你的功能同时应用到所有行：

temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

这会快得多，然后您可以根据条件过滤行：

cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]

矢量化算术需要更多的时间和思考问题，但是有时你可以在执行时间内节省几个数量级。

Answer 9

从accumulate()看一下{purrr}函数：

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}

Answer 10

使用data.table进行处理是一个可行的选择：

n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")

library(data.table)

dayloop2.dt <- function(df) {
  dt <- data.table(df)
  dt[, Kumm. := {
    res <- .I;
    ifelse (res > 1,             
      ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) , 
        res <- col9 + shift(res)                   
      , # else
        res <- col9                                 
      )
     , # else
      res <- col9
    )
  }
  ,]
  res <- data.frame(dt)
  return (res)
}

res <- dayloop2.dt(df)

m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
#       expr      min        lq     mean   median       uq      max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042    10

如果忽略条件过滤的可能收益，则速度非常快。显然，如果你可以对数据子集进行计算，那会有所帮助。

加速R中的循环操作

10 个答案:

矢量化的力量

调整已调整的