加快缓慢的dplyr代码

时间:2015-05-03 07:45:37

标签: r benchmarking dplyr

使用cut将连续变量转换为dplyr的因子非常慢。使用我的真实数据(有400,000行和96个变量),需要58秒。

我的data.frame看起来像这样:

library(ggplot2)
diamonds <- rbind(diamonds, diamonds, diamonds, diamonds, diamonds, diamonds, diamonds, diamonds)

我的慢代码与此非常相似:

library(dplyr)
mutate(diamonds, price.bands = cut(price, c(326, 1000, 10000, 19000), labels = c("low", "mid", "high"), include.lowest=T))

我可以使用更快的代码吗?

1 个答案:

答案 0 :(得分:1)

我的电脑似乎并不慢:

> system.time({
+ x <- mutate(dia, price.bands = cut(price, c(326, 1000, 10000, 19000), labels = c("low", "mid", "high"), include.lowest=T))
+ })
   user  system elapsed 
   0.20    0.02    0.38 
> 
> str(x)
'data.frame':   431520 obs. of  11 variables:
 $ carat      : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut        : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color      : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity    : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth      : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table      : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price      : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x          : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y          : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z          : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
 $ price.bands: Factor w/ 3 levels "low","mid","high": 1 1 1 1 1 1 1 1 1 1 ...
>