Question

我有一个包含37000行和27000列的data.table。我想在将数据用于预测任务之前进行预处理和缩放每一列。

我正在使用this中提到的方法，但发现它的运行速度非常慢，甚至使R Studio崩溃。我附上以下方法，以供参考。有没有一种更快的方法来缩放大data.table的所有列？

scale.cols <- colnames(DT)
DT[, (scale.cols) := lapply(.SD, scale), .SDcols = scale.cols]

Answer 1

假设您首先可以使用矩阵格式的数据，因为对于大量列，data.table中的数据不会很快，那么一种可能性是使用RcppArmadillo

scale.cpp：

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;

// [[Rcpp::export]]
arma::mat armaScale(arma::mat Z) {
    unsigned int j, n = Z.n_rows, k = Z.n_cols;
    double avg, sd;
    arma::colvec z;
    arma::mat res = arma::zeros(n, k);

    for (j=0; j<k; j++) {
        z = Z.col(j);
        avg = arma::mean(z);
        sd = arma::stddev(z);
        res.col(j) = (z - avg) / sd;
    }

    return res;
}

R代码：

set.seed(0L)
#using a smaller dataset
s <- 2e3
nr <- 3*s
nc <- 2*s
mat <- matrix(rnorm(nr*nc), ncol=nc)

library(RcppArmadillo)
library(Rcpp)
sourceCpp("scale.cpp")

library(microbenchmark)
microbenchmark(armaScale(mat), scale(mat), times=3L)

时间：

Unit: milliseconds
           expr       min        lq      mean    median        uq       max neval cld
 armaScale(mat)  272.4988  290.1339  303.5027  307.7689  319.0047  330.2404     3  a 
     scale(mat) 1290.9581 1400.7916 1445.8927 1510.6251 1523.3600 1536.0950     3   b

R：缩放Data.Table中的每一列

1 个答案: