Rcpp精度问题

时间:2016-05-04 19:02:57

标签: c++ r rounding precision rcpp

使用Rcpp考虑R中的以下C ++函数:

cppFunction('long double statZn_cpp(NumericVector dat, double kn) {
  double n = dat.size();
  // Get total sum and sum of squares; this will be the "upper sum"
  // (i.e. the sum above k)
  long double s_upper, s_square_upper;
  // The "lower sums" (i.e. those below k)
  long double s_lower, s_square_lower;
  // Get lower sums
  // Go to kn - 1 to prevent double-counting in main
  // loop
  for (int i = 0; i < kn - 1; ++i) {
    s_lower += dat[i];
    s_square_lower += dat[i] * dat[i];
  }
  // Get upper sum
  for (int i = kn - 1; i < n; ++i) {
    s_upper += dat[i];
    s_square_upper += dat[i] * dat[i];
  }
  // The maximum, which will be returned
  long double M = 0;
  // A candidate for the new maximum, used in a loop
  long double M_candidate;

  // Compute the test statistic
  for (int k = kn; k <= (n - kn); ++k) {
    // Update s and s_square for both lower and upper
    s_lower += dat[k-1];
    s_square_lower += dat[k-1] * dat[k-1];
    s_upper -= dat[k-1];
    s_square_upper -= dat[k-1] * dat[k-1];

    // Get estimate of sd for this k
    long double sdk = sqrt((s_square_lower - pow(s_lower, 2.0) / k +
                      s_square_upper -
                      pow(s_upper, 2.0) / (n - k))/n);
    M_candidate = abs(s_lower / k - s_upper / (n - k)) / sdk;
    // Choose new maximum
    if (M_candidate > M) {
      M = M_candidate;
    }
  }

  return M * sqrt(kn);
}')

尝试使用statZn_cpp(1:20,4)命令,您将得到6.963106,这是正确的答案。缩放无关紧要; statZn_cpp(1:20*10,4)也会得到6.963106的正确答案。但是statZn_cpp(1:20/10,4)会产生6.575959的错误答案,而statZn_cpp(1:20/100,4)会再次给出0的明显错误答案。更重要的是(与我的研究相关,涉及模拟研究),当我尝试statZn_cpp(rnorm(20),4)时,答案几乎总是0,这是错误的。

显然问题与舍入错误有关,但我不知道它们在哪里或如何修复它们(我是C ++的新手)。我试图尽可能地扩展精度。有没有办法解决舍入问题? (如果我应该尝试相当于预处理步骤,则允许使用R包装函数,但它必须是健壮的,适用于一般精度级别。)

编辑:这是一些“等效”的R代码:

statZn <- function(dat, kn = function(n) {floor(sqrt(n))}) {
  n = length(dat)
  return(sqrt(kn(n))*max(sapply(
      floor(kn(n)):(n - floor(kn(n))), function(k)
        abs(1/k*sum(dat[1:k]) -
              1/(n-k)*sum(dat[(k+1):n]))/sqrt((sum((dat[1:k] -
               mean(dat[1:k]))^2)+sum((dat[(k+1):n] -
               mean(dat[(k+1):n]))^2))/n))))
}

此外,下面的R代码基本上复制了C ++代码应该使用的方法。它能够得到正确答案。

  n = length(dat)
  s_lower = 0
  s_square_lower = 0
  s_upper = 0
  s_square_upper = 0
  for (i in 1:(kn-1)) {
    s_lower = s_lower + dat[i]
    s_square_lower = s_square_lower + dat[i] * dat[i]
  }
  for (i in kn:n) {
    s_upper = s_upper + dat[i]
    s_square_upper = s_square_upper + dat[i] * dat[i]
  }
  M = 0

  for (k in kn:(n-kn)) {
    s_lower = s_lower + dat[k]
    s_square_lower = s_square_lower + dat[k] * dat[k]
    s_upper = s_upper - dat[k]
    s_square_upper = s_square_upper - dat[k] * dat[k]

    sdk = sqrt((s_square_lower - (s_lower)^2/k +
                         s_square_upper -
                         (s_upper)^2/(n-k))/n)
    M_candidate = sqrt(kn) * abs(s_lower / k - s_upper / (n - k)) / sdk

    cat('k', k, '\n',
        "s_lower", s_lower, '\n',
        's_square_lower', s_square_lower, '\n',
        's_upper', s_upper, '\n',
        's_square_upper', s_square_upper, '\n',
        'sdk', sdk, '\n',
        'M_candidate', M_candidate, '\n\n')

    if (M_candidate > M) {
      M = M_candidate
    }
  }

1 个答案:

答案 0 :(得分:7)

1:您不应该使用long double,因为R代表double类型中的所有数值。使用更精确的类型进行中间计算极不可能带来任何好处,并且更有可能导致平台之间出现奇怪的不一致。

2:您尚未初始化s_uppers_square_uppers_lowers_square_lower。 (您实际上是在R实现中初始化它们,但是您忘记了C ++实现。)

3:次要问题,但我会用pow(x,2.0)替换x*x次来电。虽然这并不重要。

4:这就是为我解决的问题:您需要使用其包含的命名空间限定对C ++标准库函数的调用。 IOW,std::sqrt()而不仅仅是sqrt()std::abs()而不只是abs()std::pow(),而不仅仅是pow(),如果您继续使用它

cppFunction('double statZn_cpp(NumericVector dat, double kn) {
  int n = dat.size();
  double s_upper = 0, s_square_upper = 0; // Get total sum and sum of squares; this will be the "upper sum" (i.e. the sum above k)
  double s_lower = 0, s_square_lower = 0; // The "lower sums" (i.e. those below k)
  for (int i = 0; i < kn - 1; ++i) { s_lower += dat[i]; s_square_lower += dat[i] * dat[i]; } // Get lower sums; Go to kn - 1 to prevent double-counting in main
  for (int i = kn - 1; i < n; ++i) { s_upper += dat[i]; s_square_upper += dat[i] * dat[i]; } // Get upper sum
  double M = 0; // The maximum, which will be returned
  double M_candidate; // A candidate for the new maximum, used in a loop
  // Compute the test statistic
  for (int k = kn; k <= (n - kn); ++k) {
    // Update s and s_square for both lower and upper
    s_lower += dat[k-1];
    s_square_lower += dat[k-1] * dat[k-1];
    s_upper -= dat[k-1];
    s_square_upper -= dat[k-1] * dat[k-1];
    // Get estimate of sd for this k
    double sdk = std::sqrt((s_square_lower - s_lower*s_lower / k + s_square_upper - s_upper*s_upper / (n - k))/n);
    M_candidate = std::abs(s_lower / k - s_upper / (n - k)) / sdk;
    if (M_candidate > M) M = M_candidate; // Choose new maximum
  }
  return std::sqrt(kn) * M;
}');

statZn_cpp(1:20,4); ## you will get 6.963106, which is the correct answer
## [1] 6.963106
statZn_cpp(1:20*10,4); ## Scaling should not matter; will also yield the correct answer of 6.963106
## [1] 6.963106
statZn_cpp(1:20/10,4); ## yields the wrong answer of 6.575959
## [1] 6.963106
statZn_cpp(1:20/100,4); ## again gives you the obviously wrong answer of 0.
## [1] 6.963106
set.seed(1L); statZn_cpp(rnorm(20),4); ## More to the point (and relevant to my research, which involves simulation studies), the answer is almost always 0, which is wrong.
## [1] 1.270117