减少部分排序

时间:2019-01-22 07:33:41

标签: r sorting

正如?sort所说,如果参数 partial 不为NULL,则认为它包含结果元素的索引,这些元素的索引将放置在已排序数组中的正确位置通过部分排序。您可以阅读Argument “partial” of the sort function in R 了解更多信息。因此,如果我需要在x <- sample(1:100, 50)中找到最小的5个数字,然后

sort(x, partial = 1:5)[1:5]

sort(x)[1:5]

但是,如何找到部分排序的最大5个数字?凭直觉,我尝试使用:

sort(x, partial = 1:5, decreasing = T)

但是得到

  

sort.int(x,na.last = na.last,减少=减少,...)中的错误:     不支持的部分排序选项

因此,我的问题是在这种情况下如何实现效率的效果。

3 个答案:

答案 0 :(得分:6)

您可以从排序后的向量中提取尾巴:

set.seed(42)
x <- sample(1:100, 50)
# sort(x, partial = 1:5)[1:5] ## head

p <- length(x)+1 - (1:5) ## tail
sort(x, partial = p)[p]

如果需要,您可以使用rev()

反转结果

答案 1 :(得分:4)

您可能仍会从速度提升中受益,例如(假设数字数据):

-sort(-x, partial = 1:5)[1:5]

基准化:

set.seed(3)
x <- sample(1:100000, 500000, replace = TRUE)

bench::mark(
  snoram = -sort(-x, partial = 1:5)[1:5],
  OP = sort(x, decreasing = TRUE)[1:5],
  sotos_check = x[order(x, decreasing = TRUE)][1:5],
  jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}
)
# A tibble: 4 x 14
  expression       min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result    memory             time     gc               
  <chr>       <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>    <list>             <list>   <list>           
1 snoram        6.87ms   7.77ms   7.43ms  15.04ms     129.     5.72MB     9    34      264ms <int [5]> <Rprofmem [3 x 3]> <bch:tm> <tibble [43 x 3]>
2 OP            17.4ms  18.96ms  18.56ms  24.37ms      52.7    3.81MB     3    21      398ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [24 x 3]>
3 sotos_check  14.65ms  17.07ms  16.48ms  25.58ms      58.6    3.81MB     4    23      393ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [27 x 3]>
4 jogo          4.98ms   5.45ms   5.35ms   8.91ms     184.     3.81MB     6    37      201ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [43 x 3]>

答案 2 :(得分:0)

您还可以通过Rcpp将C ++的partial_sort与以下内容的文件一起使用:

include "Rcpp.h"
#include <algorithm>
using namespace Rcpp;

inline bool rev_comp(double const i, double const j){ 
  return i > j; 
}

// [[Rcpp::export(rng = false)]]
NumericVector cpp_partial_sort(NumericVector x, unsigned const k) {
  if(k >= x.size() or k < 1)
    throw std::invalid_argument("Invalid k");
  if(k + 1 == x.size())
    return x;
  
  NumericVector out = clone(x);
  std::partial_sort(&out[0], &out[k + 1], &out[x.size() - 1], rev_comp);
  return out;
}

我们现在可以确认我们得到相同的结果并进行基准测试:

# simulate data
set.seed(2)
x <- rnorm(10000)

# they all give the same
rk <- 5
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        -sort(-x, partial = 1:rk)[1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        sort(x, decreasing = TRUE)[1:5])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        x[order(x, decreasing = TRUE)][1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        { p <- length(x) - 0:(rk - 1); sort(x, partial = p)[p] })
#R> numeric(0)

# benchmark 
microbenchmark::microbenchmark(
  cpp = cpp_partial_sort(x, rk)[1:rk], 
  snoram = -sort(-x, partial = 1:5)[1:5],
  OP = sort(x, decreasing = TRUE)[1:5],
  sotos_check = x[order(x, decreasing = TRUE)][1:5],
  jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}, times = 1000)
#R> Unit: microseconds
#R>         expr   min    lq  mean median  uq  max neval
#R>          cpp  23.7  26.1  32.2     27  28 4384  1000
#R>       snoram 174.3 185.2 208.3    188 194 3968  1000
#R>           OP 528.6 558.4 595.9    562 574 4630  1000
#R>  sotos_check 474.9 504.4 550.7    507 519 4446  1000
#R>         jogo 172.1 182.1 194.7    186 190 3744  1000

有编译时间,但是如果多次调用cpp_partial_sort,则可以抵消。可以使用模板版本like I show here使该解决方案更通用。