Question

我有一个很长的数据帧（数百万行，几列）。为了运行固定效应回归，我想使用factor函数将分类变量声明为因子，但这非常慢。我正在寻找一种可能的解决方案来加速它。

我的代码如下：

library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))

以下是非常慢线：

my_data$col <- factor(my_data$col)

Answer 1

如果您知道要创建的因子的级别，这可以加快速度。观察：

matrix ... >>= mapMatrix ... >>= mapMatrix .. >>= ...

为了获得OP情况的水平，我们只需拨打library(microbenchmark) set.seed(237) test <- sample(letters, 10^7, replace = TRUE) microbenchmark(noLevels = factor(test), withLevels = factor(test, levels = letters), times = 20) Unit: milliseconds expr min lq mean median uq max neval cld noLevels 523.6078 545.3156 653.4833 696.4768 715.9026 862.2155 20 b withLevels 248.6904 270.3233 325.0762 291.6915 345.7774 534.2473 20 a。

unique

Kevin Ushley（Fast factor generation with Rcpp）也提供了myLevels <- unique(my_data$col) my_data$col <- factor(my_data$col, levels = myLevels)项服务。我假设一个人会知道先验的情况，我稍微修改了一下代码。引用网站中的函数为Rcpp，修改后的Rcpp函数在下面的基准测试中为RcppNoLevs。

RcppWithLevs

这是修改后的Rcpp函数，假设有人将这些级别作为参数传递：

microbenchmark(noLevels = factor(test),
               withLevels = factor(test, levels = letters),
               RcppNoLevs = fast_factor(test),
               RcppWithLevs = fast_factor_Levs(test, letters), times = 20)
Unit: milliseconds
        expr      min       lq     mean   median       uq       max neval  cld
    noLevels 571.5482 609.6640 672.1249 645.4434 704.4402 1032.7595    20    d
  withLevels 275.0570 294.5768 318.7556 309.2982 342.8374  383.8741    20   c 
  RcppNoLevs 189.5656 203.3362 213.2624 206.9281 215.6863  292.8997    20  b  
RcppWithLevs 105.7902 111.8863 120.0000 117.9411 122.8043  173.8130    20 a

R因子函数在长数据帧时运行缓慢

1 个答案: