Question

我有一个函数用于处理NA中的R，但是在更大的数据集上运行可能需要一些时间。如果有人提出改进性能的建议，我很好奇。它的一般要点是，如果列具有数字NA，它将转换为-1000并添加标记列，而如果它是因子NA则调用addNA()函数。

这是一个数据集：

df1 <- data.frame(id = 1:20, target = round(runif(20),0), col1 = runif(20), col2 = runif(20), col3 = factor(letters[1:20]))
set.seed(456)
df1[sample(1:nrow(df1),5),'col1'] <- NA
df1[sample(1:nrow(df1),5),'col2'] <- NA
df1[sample(1:nrow(df1),5),'col3'] <- NA

这是功能和用途：

varsToUse <- c('col1','col2','col3')


fixNa <- function(x, varList){

        for(i in 1:length(varList)){  # i = 1
          colNam1 <- varList[i]
          if(class(x[,colNam1]) %in% c('numeric','integer')){
            newColName <- paste(colNam1,'_isNA',sep='')
            x[,newColName] <- ifelse(is.na(x[,colNam1]), 1, 0)
            x[,colNam1] <- ifelse(is.na(x[,colNam1]), -1000, x[,colNam1])
            varList <- c(varList, newColName)
            print(i);flush.console()
          }

          if(class(x[,colNam1]) %in% c('factor')){
            x[,colNam1] <- addNA(x[,colNam1])
          }
        }
    return(x)
}

df1 <- fixNa(df1, varsToUse)

有什么建议吗？

Answer 1

这更快：

fixNa2 <- function(x, varlist){
   for(i in seq_along(varlist)){  # i = 1
     if(class(x[,varlist[i]]) %in% c('numeric','integer')){
       newColName <- paste(varlist[i],'_isNA',sep='')
       x[,newColName] <- as.numeric(is.na(x[,varlist[i]]))
       x[is.na(x[,varlist[i]]),varlist[i]] <- -1000
     }
     else if(class(x[,varlist[i]]) %in% c('factor')){
       x[,varlist[i]] <- addNA(x[,varlist[i]])
     }
   }
   return(x)
}

它会跳过打印，ifelse结构以及其他一些小问题。

基准：

> library(microbenchmark)
> microbenchmark(fixNa(df1, varsToUse),fixNa2(df1, varsToUse))
Unit: microseconds
                   expr      min        lq   median        uq       max neval
  fixNa(df1, varsToUse) 8505.560 9893.6455 9990.829 10135.546 12557.622   100
 fixNa2(df1, varsToUse)  909.868  970.8715 1013.594  1062.474  4490.446   100

R：提高NA处理功能的性能

1 个答案: