Question

Debian 系统上 C 数学库的 GCC 实现显然符合（IEEE 754-2008）标准的函数实现exp，暗示四舍五入应始终正确：

（from Wikipedia）IEEE浮点标准保证加，减，乘，除，融合乘加，平方根和浮点余数将给出无限精度运算的正确舍入结果。对于更复杂的功能，1985年标准中没有给出这样的保证，它们通常只能在最后一点内准确到达。但是，2008标准保证符合要求的实现将给出正确的舍入结果，这些结果遵循主动舍入模式;但是，函数的实现是可选的。

事实证明，我遇到了这个功能实际上阻碍的情况，因为exp函数的确切结果通常几乎恰好位于两个连续double值之间的中间位置（1），然后程序进行了大量的进一步计算，速度降低了400（！）：这实际上是对我的解释（不好问：-S）Question #43530011。

（1）更准确地说，当exp的论证变成（2 k + 1）×2 ^{-53 使用 k 一个相当小的整数（例如242）。特别是，当pow (1. + x, 0.5)的数量级为2 ^-44时，exp所涉及的计算倾向于使用这样的参数调用x。}

由于正确舍入的实现在某些情况下可能非常耗时，我想开发人员也会设计出一种方法来获得稍微不那么精确的结果（例如，最多只有0.6 ULP或类似的东西）在给定范围内参数的每个值（大致）限定的时间内...（2）

......但是如何做到这一点？

（2）我的意思是我只是不希望像（2 k + 1）×2 ^-53这样的参数的某些特殊值会是比大多数相同数量级的值更耗时;但是我当然不介意参数的某些特殊值是否更快，或者如果大参数（绝对值）需要更长的计算时间。

这是一个显示现象的最小程序：

#include <stdlib.h> #include <stdio.h> #include <math.h> #include <time.h> int main (void) { int i; double a, c; c = 0; clock_t start = clock (); for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations. { a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53. c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a). } clock_t stop = clock (); printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation. printf ("Clock time spent: %d\n", stop - start); return 0; }

现在gcc -std=c99 program53.c -lm -o program53之后：

$ ./program53 1.000000e+06 Clock time spent: 13470008 $ ./program53 1.000000e+06 Clock time spent: 13292721 $ ./program53 1.000000e+06 Clock time spent: 13201616

另一方面，使用program52和program54（通过将0x20000000000000替换为resp。0x10000000000000和0x40000000000000来获得）：

$ ./program52 1.000000e+06 Clock time spent: 83594 $ ./program52 1.000000e+06 Clock time spent: 69095 $ ./program52 1.000000e+06 Clock time spent: 54694 $ ./program54 1.000000e+06 Clock time spent: 86151 $ ./program54 1.000000e+06 Clock time spent: 74209 $ ./program54 1.000000e+06 Clock time spent: 78612

注意，这种现象依赖于实现！显然，在常见的实现中，只有 Debian 系统的那些（包括 Ubuntu ）显示出这种现象。

P.-S。：我希望我的问题不重复：我彻底搜索了一个类似的问题但没有成功，但也许我注意到了使用相关的关键词...： - /

Answer 1

回答关于为什么库函数需要给出正确舍入结果的一般问题：

浮点很难，而且常常违反直觉。并非所有程序员都阅读what they should have。当库用于允许一些稍微不准确的舍入时，人们抱怨库函数的精度，当它们的不准确计算不可避免地出错并产生无意义时。作为回应，图书馆作家使他们的图书馆完全圆润，所以现在人们不能把责任推卸给他们。

在许多情况下，有关浮点算法的特定知识可以在准确性和/或性能方面产生相当大的改进，就像在测试用例中一样：

将exp()的数字与浮点数非常接近0是有问题的，因为结果是一个接近1的数字，而所有精度都在差异为1，因此丢失了最重要的数字。通过C数学库函数exp(x) - 1计算expm1(x)更精确（在此测试用例中显着更快）。如果exp()本身确实，那么expm1(x) + 1的速度要快得多。

计算log(1 + x)存在类似问题，其函数为log1p(x)。

快速修复，加快提供的测试用例：

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
{
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
    {
      a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
      c += expm1 (a) + 1; // replace exp() with expm1() + 1
    }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
}

对于这个的情况，我的机器上的时间是：

原始代码

1.000000e + 06

花时钟时间：21543338

修改后的代码

1.000000e + 06

花费的时钟时间：55076

对于伴随权衡的高级知识的程序员有时可能会考虑使用精度不重要的近似结果

对于有经验的程序员，可以使用Newton-Raphson，Taylor或Maclaurin多项式等方法编写慢函数的近似实现，特别是来自Intel＆amp; MKL，AMD＆＃39等库的不完全舍入的专业函数; s AMCL，放宽了编译器的浮点标准兼容性，降低了ieee754 binary32（float）的精度，或者它们的组合。

请注意，更好地描述问题可以获得更好的答案。

Answer 2

这是EOF之前评论的“答案”/后续内容，他的trecu（）算法和代码用于他的“二叉树求和”建议。阅读本文之前的“先决条件”正在阅读该讨论。在一个有组织的地方收集所有这些东西会很好，但我还没有这样做......

...我所做的是通过修改OP的原始测试程序，从前面的答案中将EOF的trecu（）构建到测试程序中。但后来我发现trecu（）使用 exp（） c 相同的答案（我的意思是完全） >，而不是使用 expm1（）的总和 cm1 ，这是我们从更准确的二叉树总和中得到的预期。

但是那个测试程序有点（可能是两位:)“复杂”（或者，如EOF所说，“不可读”），所以我写了一个单独的小测试程序，如下所示（下面是示例运行和讨论），分别测试/锻炼trecu（）。此外，我还在下面的代码中编写了函数bintreesum（），它抽象/封装了我嵌入到前面测试程序中的二叉树求和的迭代代码。在前面的例子中，我的迭代代码确实接近 cm1 答案，这就是为什么我期望EOF的递归trecu（）做同样的事情。它的长短是在下面，同样的事情发生 - bintreesum（）仍然接近正确答案，而trecu（）走得更远，正好再现了“普通和”。

我们在下面总结的只是sum（i），i = 1 ... n，这是众所周知的n（n + 1）/ 2。但这并不完全正确 - 重现OP的问题，summand不是sum（i）而是sum（1 + i * 10 ^（ - e）），其中e可以在命令行中给出。因此，对于n = 5，你不会得到15而是5.000 ... 00015，或者对于n = 6你得到6.000 ... 00021等等。为了避免长，长格式，我printf（）sum-n删除整数部分。好的？？？所以这是代码...

/* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>

double trecu(double *vals, double sum, int n) {
  int midn = n/2;
  switch (n) {
    case  0: break;
    case  1: sum += *vals; break;
    default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
  return(sum);
  } /* --- end-of-function trecu() --- */

double bintreesum(double *vals, int n, int binsize) {
  double binsum = 0.0;
  int nbin0 = (n+(binsize-1))/binsize,
      nbin1 = (nbin0+(binsize-1))/binsize,
      nbins[2] = { nbin0, nbin1 };
  double *vbins[2] = {
            (double *)malloc(nbin0*sizeof(double)),
            (double *)malloc(nbin1*sizeof(double)) },
         *vbin0=vbins[0], *vbin1=vbins[1];
  int ibin=0, i;
  for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
  for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
  while ( nbins[ibin] > 1 ) {
    int jbin = 1-ibin;        /* other bin, 0<-->1 */
    nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
    for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
    for ( i=0; i<nbins[ibin]; i++ )
      vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
    ibin = jbin;              /* swap bins for next pass */
    } /* --- end-of-while(nbins[ibin]>0) --- */
  binsum = vbins[ibin][0];
  free((void *)vbins[0]);  free((void *)vbins[1]);
  return ( binsum );
  } /* --- end-of-function bintreesum() --- */

#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
  int N       = (argc>1? atoi(argv[1]) : 1000000 ),
      e       = (argc>2? atoi(argv[2]) : -10 ),
      binsize = (argc>3? atoi(argv[3]) : 2 );
  double tens = pow(10.0,(double)e);
  double *vals = (double *)malloc(sizeof(double)*MAXN),
         sum = 0.0;
  double trecu(), bintreesum();
  int i;
  if ( N > MAXN ) N=MAXN;
  for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
  for ( i=0; i<N; i++ ) sum += vals[i];
  printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N  =  %.8e,\n"
         "\t plain_sum-N  = %.8e,\n"
         "\t trecu-N      = %.8e,\n"
         "\t bintreesum-N = %.8e \n",
         N, tens, tens*((double)N)*((double)(N+1))/2.0,
          sum-(double)N,
         trecu(vals,0.0,N)-(double)N,
         bintreesum(vals,N,binsize)-(double)N );
  } /* --- end-of-function main() --- */
#endif

因此，如果将其保存为trecu.c，则将其编译为 cc -DTESTTRECU trecu.c -lm -o trecu 然后运行0到3个可选命令行参数 trecu #trials e binsize 默认值为＃trials = 1000000（与OP的程序类似），e = -10，binsize = 2（对于我的bintreesum（）函数来执行二叉树求和而不是更大尺寸仓）。

以下是一些说明上述问题的测试结果，

bash-4.3$ ./trecu              
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N  =  5.00000500e+01,
         plain_sum-N  = 5.00000500e+01,
         trecu-N      = 5.00000500e+01,
         bintreesum-N = 5.00000500e+01 
bash-4.3$ ./trecu 1000000 -15
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N  =  5.00000500e-04,
         plain_sum-N  = 5.01087168e-04,
         trecu-N      = 5.01087168e-04,
         bintreesum-N = 5.00000548e-04 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -16
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N  =  5.00000500e-05,
         plain_sum-N  = 6.67552231e-05,
         trecu-N      = 6.67552231e-05,
         bintreesum-N = 5.00001479e-05 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -17
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N  =  5.00000500e-06,
         plain_sum-N  = 0.00000000e+00,
         trecu-N      = 0.00000000e+00,
         bintreesum-N = 4.99992166e-06

所以你可以看到，对于默认运行，e = -10，每个人都做对了。也就是说，表示“Sum”的顶行只是n（n + 1）/ 2，所以可能会显示正确的答案。以下所有人都同意默认的e = -10测试用例。但是对于e = -15和e = -16以下的情况，trecu（）与plain_sum完全一致，而bintreesum保持非常接近正确的答案。最后，对于e = -17，plain_sum和trecu（）已经“消失”，而bintreesum（）仍然很好地挂在那里。

所以trecu（）正确地做了总和，但它的递归显然没有做那种“二叉树”类型的东西，我更直接的迭代bintreesum（）显然正确地做了。这确实证明了EOF对“二叉树求和”的建议对于这些1 + epsilon类型的情况实现了对plain_sum的相当大的改进。所以我们真的很想看到他的trecu（）递归工作！当我最初看到它时，我认为它确实有效。但是在他的默认：案例中，这种双递归（有一个特殊的名称？）显然比我想象的更令人困惑（至少对我来说）。就像我说的那样，是做总和，而不是“二叉树”。

好的，那么谁愿意接受挑战并解释在trecu（）递归中发生了什么？而且，也许更重要的是，修复它以实现预期目标。感谢。

回答关于为什么库函数需要给出正确舍入结果的一般问题：

在许多情况下，有关浮点算法的特定知识可以在准确性和/或性能方面产生相当大的改进，就像在测试用例中一样：

对于伴随权衡的高级知识的程序员有时可能会考虑使用精度不重要的近似结果

我没有*想要函数exp

2 个答案: