Question

我有一个包含一堆双元素的大向量。给定一个百分位向量数组，例如percentile_vec = c(0.90, 0.91, 0.92, 0.93, 0.94, 0.95)。我目前正在使用Rcpp sort函数对大向量进行排序，然后找到相应的百分位数值。这是主要代码：

// [[Rcpp::export]]
NumericVector sort_rcpp(Rcpp::NumericVector& x)
{
  std::vector<double> tmp = Rcpp::as<std::vector<double>> (x);    // or NumericVector tmp = clone(x);
  std::sort(tmp.begin(), tmp.end());
  return wrap(tmp);
}

// [[Rcpp::export]]
NumericVector percentile_rcpp(Rcpp::NumericVector& x, Rcpp::NumericVector& percentile)
{
  NumericVector tmp_sort = sort_rcpp(x);
  int size_per = percentile.size();
  NumericVector percentile_vec = no_init(size_per);
  for (int ii = 0; ii < size_per; ii++)
  {
    double size_per = tmp_sort.size() * percentile[ii];
    double size_per_round;
    if (size_per < 1.0)
    {
      size_per_round = 1.0;
    }
    else
    {
      size_per_round = std::round(size_per);
    }
    percentile_vec[ii] = tmp_sort[size_per_round-1];  // For extreme case such as size_per_round == tmp_sort.size() to avoid overflow
  }
  return percentile_vec;
}

我还尝试使用：

在Rcpp中调用R函数quantile(x, c(.90, .91, .92, .93, .94, .95))

sub_percentile <- function (x)
{
  return (quantile(x, c(.90, .91, .92, .93, .94, .95)));
}  

source('C:/Users/~Call_R_function.R')

下面列出了x=runif(1E6)的测试任务：

microbenchmark(sub_percentile(x)->aa, percentile_rcpp(x, c(.90, .91, .92, .93, .94, .95))->bb)
#Unit: milliseconds
              expr      min       lq     mean   median       uq       max   neval
  sub_percentile(x) 99.00029 99.24160 99.35339 99.32162 99.41869 100.57160   100
 percentile_rcpp(~) 87.13393 87.30904 87.44847 87.40826 87.51547  88.41893   100

我期望快速百分位计算，但我认为std::sort(tmp.begin(), tmp.end())会降低速度。有没有更好的方法来使用C ++，RCpp / RcppAramdillo获得快速结果？感谢。

Answer 1

循环中的分支可以肯定地进行优化。使用带有整数的std :: min / max调用。

我会用这种方式解决数组索引的百分比计算：

uint PerCentIndex( double pc, uint size )
{
    return 0.5 + ( double ) ( size - 1 ) * pc;
}

仅在上面循环中间的这一行：

percentile_vec[ii] 
 = tmp_sort[ PerCentIndex( percentile[ii], tmp_sort.size() ) ];

Answer 2

根据您需要计算的百分位数以及向量的大小，您可以做得更好（仅O（N））比排序整个向量（最好是O（N * log（N）））。

我必须计算1百分位数的向量（> = 160K）元素，所以我做的是以下内容：

void prctile_stl(double* in, const dim_t &len, const double &percent, std::vector<double> &range) {
// Calculates "percent" percentile.
// Linear interpolation inspired by prctile.m from MATLAB.

double r = (percent / 100.) * len;

double lower = 0;
double upper = 0;
double* min_ptr = NULL;
dim_t k = 0;

if(r >= len / 2.) {     // Second half is smaller
    dim_t idx_lo = max(r - 1, (double) 0.);
    nth_element(in, in + idx_lo, in + len);             // Complexity O(N)
    lower = in[idx_lo];
    if(idx_lo < len - 1) {
        min_ptr = min_element(&(in[idx_lo + 1]), in + len);
        upper = *min_ptr;
        }
    else
        upper = lower;
    }
else {                  // First half is smaller
    double* max_ptr;
    dim_t idx_up = ceil(max(r - 1, (double) 0.));
    nth_element(in, in + idx_up, in + len);             // Complexity O(N)
    upper = in[idx_up];
    if(idx_up > 0) {
        max_ptr = max_element(in, in + idx_up);
        lower = *max_ptr;
        }
    else
        lower = upper;
    }

// Linear interpolation
k = r + 0.5;        // Implicit floor
r = r - k;
range[1] = (0.5 - r) * lower + (0.5 + r) * upper;

min_ptr = min_element(in, in + len);
range[0] = *min_ptr;
}

另一种选择是来自Numerical Recepies 3rd的IQAgent算法。埃德。它最初用于数据流但你可以通过将大数据向量分成更小的块（例如10K元素）来欺骗它，并计算每个块的百分位数（使用10K块上的排序）。如果你一次处理一个块，每个连续的块将稍微修改百分位数的值，直到你得到一个非常好的近似值。该算法给出了良好的结果（最多为第3或第4个十进制），但仍然比第n个元素实现慢。

如何在C ++ / Rcpp中进行快速百分位数计算

2 个答案: