使用speedier hist()或findInterval()获得与cut()相同的输出?

时间:2014-02-14 09:32:48

标签: r histogram cut

我读过这篇文章http://www.r-bloggers.com/comparing-hist-and-cut-r-functions/并且在我的电脑上测试hist()cut()快了~4倍。我的脚本循环遍历cut()很多次,因此省时很重要。因此,我尝试切换到更快的功能,但很难按照cut()获得准确的输出。

来自以下示例代码:

data <- rnorm(10, mean=0, sd=1)  #generate data
my_breaks <- seq(-6, 6, by=1)  #create a vector that specifies my break points
cut(data, breaks=my_breaks)

我希望得到一个包含级别的向量,使用我的断点将每个数据元素分配给它,即cut的确切输出:

 [1] (1,2]   (-1,0]  (0,1]   (1,2]   (0,1]   (-1,0]  (-1,0]  (0,1]   (-2,-1] (0,1]  
Levels: (-6,-5] (-5,-4] (-4,-3] (-3,-2] (-2,-1] (-1,0] (0,1] (1,2] (2,3] (3,4] (4,5] (5,6]
> 

我的问题:如何使用hist()输出的元素(即中断,计数,密度,中等等)或findInterval来达到我的目标? < / p>

另外,我使用findIntervalhttps://stackoverflow.com/questions/12379128/r-switch-statement-on-comparisons找到了一个示例,但这需要我事先创建间隔标签,这不是我想要的。

任何帮助将不胜感激。提前谢谢!

2 个答案:

答案 0 :(得分:6)

以下是基于findInterval建议的实施,比经典cut快5-6倍:

cut2 <- function(x, breaks) {
  labels <- paste0("(",  breaks[-length(breaks)], ",", breaks[-1L], "]")
  return(factor(labels[findInterval(x, breaks)], levels=labels))
}

library(microbenchmark)

set.seed(1)
data <- rnorm(1e4, mean=0, sd=1)

microbenchmark(cut.default(data, my_breaks), cut2(data, my_breaks))

# Unit: microseconds
#                         expr      min        lq    median        uq      max neval
# cut.default(data, my_breaks) 3011.932 3031.1705 3046.5245 3075.3085 4119.147   100
#        cut2(data, my_breaks)  453.761  459.8045  464.0755  469.4605 1462.020   100

identical(cut(data, my_breaks), cut2(data, my_breaks))
# TRUE

答案 1 :(得分:4)

hist函数按照与tablecut组合类似的方式创建分类计数。例如,

set.seed(1)
x <- rnorm(100)

hist(x, plot = FALSE)
## $breaks
##  [1] -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5
## 
## $counts
##  [1]  1  3  7 14 21 20 19  9  4  2

table(cut(x, seq.int(-2.5, 2.5, 0.5)))
## (-2.5,-2] (-2,-1.5] (-1.5,-1] (-1,-0.5]  (-0.5,0]   (0,0.5]   (0.5,1]
##         1         3         7        14        21        20        19
##   (1,1.5]   (1.5,2]   (2,2.5] 
##         9         4         2

如果您想要cut的原始输出,则无法使用hist

但是,如果cut的速度有问题(并且您可能想要仔细检查它确实是分析的缓慢部分;请参阅premature optimization is the root of all evil),那么您可以使用较低的1级}}。这忽略了.bincode

的输入检查和标签创建功能
cut