Question

正如?sort所说，如果参数 partial 不为NULL，则认为它包含结果元素的索引，这些元素的索引将放置在已排序数组中的正确位置通过部分排序。您可以阅读Argument “partial” of the sort function in R 了解更多信息。因此，如果我需要在x <- sample(1:100, 50)中找到最小的5个数字，然后

sort(x, partial = 1:5)[1:5]

比

快

sort(x)[1:5]

但是，如何找到部分排序的最大5个数字？凭直觉，我尝试使用：

sort(x, partial = 1:5, decreasing = T)

但是得到

sort.int（x，na.last = na.last，减少=减少，...）中的错误：不支持的部分排序选项

因此，我的问题是在这种情况下如何实现效率的效果。

Answer 1

您可以从排序后的向量中提取尾巴：

set.seed(42)
x <- sample(1:100, 50)
# sort(x, partial = 1:5)[1:5] ## head

p <- length(x)+1 - (1:5) ## tail
sort(x, partial = p)[p]

如果需要，您可以使用rev()

反转结果

Answer 2

您可能仍会从速度提升中受益，例如（假设数字数据）：

-sort(-x, partial = 1:5)[1:5]

基准化：

set.seed(3)
x <- sample(1:100000, 500000, replace = TRUE)

bench::mark(
  snoram = -sort(-x, partial = 1:5)[1:5],
  OP = sort(x, decreasing = TRUE)[1:5],
  sotos_check = x[order(x, decreasing = TRUE)][1:5],
  jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}
)
# A tibble: 4 x 14
  expression       min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result    memory             time     gc               
  <chr>       <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>    <list>             <list>   <list>           
1 snoram        6.87ms   7.77ms   7.43ms  15.04ms     129.     5.72MB     9    34      264ms <int [5]> <Rprofmem [3 x 3]> <bch:tm> <tibble [43 x 3]>
2 OP            17.4ms  18.96ms  18.56ms  24.37ms      52.7    3.81MB     3    21      398ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [24 x 3]>
3 sotos_check  14.65ms  17.07ms  16.48ms  25.58ms      58.6    3.81MB     4    23      393ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [27 x 3]>
4 jogo          4.98ms   5.45ms   5.35ms   8.91ms     184.     3.81MB     6    37      201ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [43 x 3]>

Answer 3

您还可以通过Rcpp将C ++的partial_sort与以下内容的文件一起使用：

include "Rcpp.h"
#include <algorithm>
using namespace Rcpp;

inline bool rev_comp(double const i, double const j){ 
  return i > j; 
}

// [[Rcpp::export(rng = false)]]
NumericVector cpp_partial_sort(NumericVector x, unsigned const k) {
  if(k >= x.size() or k < 1)
    throw std::invalid_argument("Invalid k");
  if(k + 1 == x.size())
    return x;
  
  NumericVector out = clone(x);
  std::partial_sort(&out[0], &out[k + 1], &out[x.size() - 1], rev_comp);
  return out;
}

我们现在可以确认我们得到相同的结果并进行基准测试：

# simulate data
set.seed(2)
x <- rnorm(10000)

# they all give the same
rk <- 5
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        -sort(-x, partial = 1:rk)[1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        sort(x, decreasing = TRUE)[1:5])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        x[order(x, decreasing = TRUE)][1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk], 
        { p <- length(x) - 0:(rk - 1); sort(x, partial = p)[p] })
#R> numeric(0)

# benchmark 
microbenchmark::microbenchmark(
  cpp = cpp_partial_sort(x, rk)[1:rk], 
  snoram = -sort(-x, partial = 1:5)[1:5],
  OP = sort(x, decreasing = TRUE)[1:5],
  sotos_check = x[order(x, decreasing = TRUE)][1:5],
  jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}, times = 1000)
#R> Unit: microseconds
#R>         expr   min    lq  mean median  uq  max neval
#R>          cpp  23.7  26.1  32.2     27  28 4384  1000
#R>       snoram 174.3 185.2 208.3    188 194 3968  1000
#R>           OP 528.6 558.4 595.9    562 574 4630  1000
#R>  sotos_check 474.9 504.4 550.7    507 519 4446  1000
#R>         jogo 172.1 182.1 194.7    186 190 3744  1000

有编译时间，但是如果多次调用cpp_partial_sort，则可以抵消。可以使用模板版本like I show here使该解决方案更通用。

减少部分排序

3 个答案: