正如?sort
所说,如果参数 partial 不为NULL,则认为它包含结果元素的索引,这些元素的索引将放置在已排序数组中的正确位置通过部分排序。您可以阅读Argument “partial” of the sort function in R
了解更多信息。因此,如果我需要在x <- sample(1:100, 50)
中找到最小的5个数字,然后
sort(x, partial = 1:5)[1:5]
比
快sort(x)[1:5]
但是,如何找到部分排序的最大5个数字?凭直觉,我尝试使用:
sort(x, partial = 1:5, decreasing = T)
但是得到
sort.int(x,na.last = na.last,减少=减少,...)中的错误: 不支持的部分排序选项
因此,我的问题是在这种情况下如何实现效率的效果。
答案 0 :(得分:6)
您可以从排序后的向量中提取尾巴:
set.seed(42)
x <- sample(1:100, 50)
# sort(x, partial = 1:5)[1:5] ## head
p <- length(x)+1 - (1:5) ## tail
sort(x, partial = p)[p]
如果需要,您可以使用rev()
答案 1 :(得分:4)
您可能仍会从速度提升中受益,例如(假设数字数据):
-sort(-x, partial = 1:5)[1:5]
基准化:
set.seed(3)
x <- sample(1:100000, 500000, replace = TRUE)
bench::mark(
snoram = -sort(-x, partial = 1:5)[1:5],
OP = sort(x, decreasing = TRUE)[1:5],
sotos_check = x[order(x, decreasing = TRUE)][1:5],
jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}
)
# A tibble: 4 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 snoram 6.87ms 7.77ms 7.43ms 15.04ms 129. 5.72MB 9 34 264ms <int [5]> <Rprofmem [3 x 3]> <bch:tm> <tibble [43 x 3]>
2 OP 17.4ms 18.96ms 18.56ms 24.37ms 52.7 3.81MB 3 21 398ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [24 x 3]>
3 sotos_check 14.65ms 17.07ms 16.48ms 25.58ms 58.6 3.81MB 4 23 393ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [27 x 3]>
4 jogo 4.98ms 5.45ms 5.35ms 8.91ms 184. 3.81MB 6 37 201ms <int [5]> <Rprofmem [2 x 3]> <bch:tm> <tibble [43 x 3]>
答案 2 :(得分:0)
您还可以通过Rcpp将C ++的partial_sort
与以下内容的文件一起使用:
include "Rcpp.h"
#include <algorithm>
using namespace Rcpp;
inline bool rev_comp(double const i, double const j){
return i > j;
}
// [[Rcpp::export(rng = false)]]
NumericVector cpp_partial_sort(NumericVector x, unsigned const k) {
if(k >= x.size() or k < 1)
throw std::invalid_argument("Invalid k");
if(k + 1 == x.size())
return x;
NumericVector out = clone(x);
std::partial_sort(&out[0], &out[k + 1], &out[x.size() - 1], rev_comp);
return out;
}
我们现在可以确认我们得到相同的结果并进行基准测试:
# simulate data
set.seed(2)
x <- rnorm(10000)
# they all give the same
rk <- 5
setdiff(cpp_partial_sort(x, rk)[1:rk],
-sort(-x, partial = 1:rk)[1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk],
sort(x, decreasing = TRUE)[1:5])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk],
x[order(x, decreasing = TRUE)][1:rk])
#R> numeric(0)
setdiff(cpp_partial_sort(x, rk)[1:rk],
{ p <- length(x) - 0:(rk - 1); sort(x, partial = p)[p] })
#R> numeric(0)
# benchmark
microbenchmark::microbenchmark(
cpp = cpp_partial_sort(x, rk)[1:rk],
snoram = -sort(-x, partial = 1:5)[1:5],
OP = sort(x, decreasing = TRUE)[1:5],
sotos_check = x[order(x, decreasing = TRUE)][1:5],
jogo = {p <- length(x) - 0:4; sort(x, partial = p)[p]}, times = 1000)
#R> Unit: microseconds
#R> expr min lq mean median uq max neval
#R> cpp 23.7 26.1 32.2 27 28 4384 1000
#R> snoram 174.3 185.2 208.3 188 194 3968 1000
#R> OP 528.6 558.4 595.9 562 574 4630 1000
#R> sotos_check 474.9 504.4 550.7 507 519 4446 1000
#R> jogo 172.1 182.1 194.7 186 190 3744 1000
有编译时间,但是如果多次调用cpp_partial_sort
,则可以抵消。可以使用模板版本like I show here使该解决方案更通用。