Question

我有一个数值矩阵，大约有1000万个值，只需要显示直方图中的值分布即可。在基数R中，hist()的执行速度非常快。但是，如果我想使用ggplot，它的速度要慢得多（我还必须先融化矩阵，但这不是时间限制步骤）。有什么方法可以通过ggplot快速实现吗？

require(microbenchmark)
require(ggplot2)


mtx1 <- matrix(rnorm(6e4*150), nrow = 6e4)
df1 <- reshape2::melt(mtx1)

g_hist <- function(df){
  print(ggplot(df, aes(x=value)) + geom_histogram(bins=30))
}

print(microbenchmark(
  hist(mtx1), 
  g_hist(df1), 
times=3L 
), signif=3)


# Unit: milliseconds
#        expr  min   lq mean median   uq  max neval
#  hist(mtx1)  384  471  530    559  603  647     3
# g_hist(df1) 7710 8000 8190   8300 8440 8570     3

Answer 1

这里是使用基本R hist()函数计算直方图bin和bin计数的解决方案。（计算垃圾桶确实确实是geom_histogram()瓶颈的根源。）

然后，我将计算出的bin数量和bin边界与geom_rect()一起使用，以绘制看起来与geom_histogram()产生的直方图非常相似的直方图。

所需时间仍大于基数hist()，但是1.5倍而不是20倍。

quick_hist = function(values_vec, breaks=50) {
    res = hist(values_vec, plot=FALSE, breaks=breaks)

    dat = data.frame(xmin=head(res$breaks, -1L),
                     xmax=tail(res$breaks, -1L),
                     ymin=0.0,
                     ymax=res$counts)

    ggplot(dat, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax)) +
    geom_rect(size=0.5, colour="grey30", fill="grey80")
}


ggsave("quick_hist.png", 
       plot=quick_hist(mtx1) + theme_bw(), 
       width=8, height=4, dpi=150)


print(microbenchmark(hist(mtx1), 
                     g_hist(df1), 
                     print(quick_hist(mtx1, breaks=30)),
                     times=5L), signif=3)

# Unit: milliseconds
#                                  expr  min   lq mean median   uq  max neval
#                            hist(mtx1)  264  270  305    298  332  359     5
#                           g_hist(df1) 5740 5760 6180   5770 5920 7700     5
#  print(quick_hist(mtx1, breaks = 30))  407  418  440    433  440  503     5

ggplot2中大矩阵的绘制直方图比基本hist（）慢20倍

1 个答案: