Question

我有一些代码可以对双打执行许多日志，tan和cos操作。我需要这个尽可能快。目前我使用的代码如

#include <stdio.h>
#include <stdlib.h>
#include "mtwist.h"
#include <math.h>


int main(void) {
   int i;
   double x;
   mt_seed();
   double u1;
   double u2;
   double w1;
   double w2;
   x = 0;
   for(i = 0; i < 100000000; ++i) {
     u1 = mt_drand();
     u2 = mt_drand();
     w1 = M_PI*(u1-1/2.0);
     w2 = -log(u2);
     x += tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
   }
   printf("%f\n",x); 

   return EXIT_SUCCESS;
}

我正在使用gcc。

有两种明显的方法可以加快速度。首先是选择更快的RNG。第二是加快先验功能要做到这一点，我想知道

如何在x86上的程序集中实现tan和cos？我的CPU是AMD FX-8350，如果它有所作为。（对fcos回答cos，为fptan回答tan。）
如何使用查找表来加速计算？我只需要32位的精度。例如，你可以使用一个大小为2 ^ 16的表来加速tan和cos操作吗？

Intel optimization manual说

如果没有迫切需要评估超越功能使用80位的扩展精度，应用程序应该考虑一种替代的，基于软件的方法，例如基于查找表的方法使用插值技术的算法。有可能改善通过选择这些技术，通过这些技术获得超越性能期望的数字精度和查找表的大小，以及利用SSE和SSE2的并行性指令。

根据这个非常有用的table，fcos有延迟154，fptan有延迟166-231。

您可以使用

gcc -O3 -Wall random.c mtwist-1.5 / mtwist.c -lm -o random

我的C代码使用来自here的Mersenne Twister RNG C代码。您应该能够运行我的代码来测试它。如果你不能，请告诉我。

更新 @rhashimoto将我的代码从20秒加速到6秒！

RNG似乎应该可以加速。但是在我的测试http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html#dSFMT中花费的时间完全相同（有人看到不同的东西）。如果有人能找到更快的RNG（通过所有死硬测试），我将非常感激。

请显示您建议的任何改进的实际时间，因为这有助于确定哪些有效或无效。

Answer 1

你可以改写

tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1))

作为

tan(w1)*(M_PI_2-w1) + log(cos(w1)/(M_PI_2-w1)) + log(w2).

根据这里的w1，您可以使用minimax多项式来处理这些东西。组成64个左右，每个组成1/64的范围，你可能只需要3级或4级。

您将w2计算为

w2 = -log(u2);

表示u2中的统一(0,1)。所以你真的在计算log(log(1/u2))。我打赌你可以使用类似的技巧在log(log(1/x))的块上获得(0,1)的分段多项式近似。（该功能在0和1附近可怕起作用，因此您可能需要在那里做一些奇特的事情。）

Answer 2

我喜欢@ tmyklebu的建议，即为整体计算创建一个minimax近似值。有一些很好的工具可以帮助解决这个问题，包括Remez function approximation toolkit

你可以比MT做得更好;例如，见Dr. Dobbs article: Fast, High-Quality, Parallel Random Number Generators

另请查看ACML – AMD Core Math Library以利用SSE和SSE2。

Answer 3

您可以尝试使用SSE2内在函数编写的log(x)替换：

#include <assert.h>
#include <immintrin.h>

static __m128i EXPONENT_MASK;
static __m128i EXPONENT_BIAS;
static __m128i EXPONENT_ZERO;
static __m128d FIXED_SCALE;
static __m128d LOG2ERECIP;
static const int EXPONENT_SHIFT = 52;

// Required to initialize constants.
void sselog_init() {
   EXPONENT_MASK = _mm_set1_epi64x(0x7ff0000000000000UL);
   EXPONENT_BIAS = _mm_set1_epi64x(0x00000000000003ffUL);
   EXPONENT_ZERO = _mm_set1_epi64x(0x3ff0000000000000UL);
   FIXED_SCALE = _mm_set1_pd(9.31322574615478515625e-10); // 2^-30
   LOG2ERECIP = _mm_set1_pd(0.693147180559945309417232121459); // 1/log2(e)
}

// Extract IEEE754 double exponent as integer.
static inline __m128i extractExponent(__m128d x) {
   return
      _mm_sub_epi64(
         _mm_srli_epi64(
            _mm_and_si128(_mm_castpd_si128(x), EXPONENT_MASK),
            EXPONENT_SHIFT),
         EXPONENT_BIAS);
}

// Set IEEE754 double exponent to zero.
static inline __m128d clearExponent(__m128d x) {
   return
      _mm_castsi128_pd(
         _mm_or_si128(
            _mm_andnot_si128(
               EXPONENT_MASK,
               _mm_castpd_si128(x)),
            EXPONENT_ZERO));
}

// Compute log(x) using SSE2 intrinsics to >= 30 bit precision, except denorms.
double sselog(double x) {
   assert(x >= 2.22507385850720138309023271733e-308); // no denormalized

   // Two independent logarithms could be computed by initializing
   // base with two different values, either with independent
   // arguments to _mm_set_pd() or from contiguous memory with
   // _mm_load_pd(). No other changes should be needed other than to
   // extract both results at the end of the function (or just return
   // the packed __m128d).

   __m128d base = _mm_set_pd(x, x);
   __m128i iLog = extractExponent(base);
   __m128i fLog = _mm_setzero_si128();

   base = clearExponent(base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   // fLog = _mm_slli_epi64(fLog, 10); // Not needed first time through.
   fLog = _mm_or_si128(extractExponent(base), fLog);

   base = clearExponent(base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   fLog = _mm_slli_epi64(fLog, 10);
   fLog = _mm_or_si128(extractExponent(base), fLog);

   base = clearExponent(base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   base = _mm_mul_pd(base, base);
   fLog = _mm_slli_epi64(fLog, 10);
   fLog = _mm_or_si128(extractExponent(base), fLog);

   // No _mm_cvtepi64_pd() exists so use _mm_cvtepi32_pd() conversion.
   iLog = _mm_shuffle_epi32(iLog, 0x8);
   fLog = _mm_shuffle_epi32(fLog, 0x8);

   __m128d result = _mm_mul_pd(_mm_cvtepi32_pd(fLog), FIXED_SCALE);
   result = _mm_add_pd(result, _mm_cvtepi32_pd(iLog));

   // Convert from base 2 logarithm and extract result.
   result = _mm_mul_pd(result, LOG2ERECIP);
   return ((double *)&result)[0]; // other output in ((double *)&result)[1]
}

该代码实现了this Texas Instruments brief中描述的算法，反复平方化参数并连接指数位。不将使用非规范化输入。它提供至少30位的精度。

在我的一台机器上运行速度比log()快，另一台运行速度慢，所以你的里程可能会有所不同;我并不认为这必然是最好的方法。但是，这段代码实际上是使用128位SSE2字的两半并行计算两个对数（虽然函数as-is只返回一个结果），因此它可以适用于整个SIMD计算的一个构建块函数（我认为log是困难的部分，因为cos表现得非常好。此外，您的处理器支持AVX，它可以将4个双精度元素打包成256位字，并且将此代码扩展到AVX应该很简单。

如果您选择不使用完整SIMD，您仍然可以通过流水线操作同时使用两个对数插槽 - 例如，对于 next 迭代，使用log(w2*cos(w1)/(M_PI_2-w1))计算当前迭代的log(u2)

即使此功能在log隔离时基准测试速度较慢，但仍可能需要使用实际功能进行测试。这段代码根本不会强调数据缓存，所以它可能与其他代码更友好（例如使用查找表的cos）。此外，根据其他代码是否使用SSE，还可以改进（或不改进）微指令调度。

我的其他建议（从评论中重复）将是：

尝试-march=native -mtune=native让gcc针对您的CPU进行优化。
避免在同一个参数上同时调用tan和cos - 使用sincos或trig标识。
考虑使用GPU（例如OpenCL）。

似乎最好是计算sin而不是cos - 原因是您可以将其用于tan_w1 = sin_w1/sqrt(1.0 - sin_w1*sin_w1)。使用我最初建议的cos，在计算tan时会丢失正确的符号。正如其他回答者所说的那样，你可以通过在[-pi / 2，pi / 2]上使用minimax多项式来获得良好的加速。 this link上的7项功能（确保获得minimaxsin，而不是taylorsin）似乎运作良好。

所以这是你的程序用所有SSE2 trancendental近似值重写：

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include "mtwist.h"

#if defined(__AVX__)
#define VECLEN 4
#elif defined(__SSE2__)
#define VECLEN 2
#else
#error // No SIMD available.
#endif

#if VECLEN == 4
#define VBROADCAST(K) { K, K, K, K };
typedef double vdouble __attribute__((vector_size(32)));
typedef long vlong __attribute__((vector_size(32)));
#elif VECLEN == 2
#define VBROADCAST(K) { K, K };
typedef double vdouble __attribute__((vector_size(16)));
typedef long vlong __attribute__((vector_size(16)));
#endif

static const vdouble FALLBACK_THRESHOLD = VBROADCAST(1.0 - 0.001);

vdouble sse_sin(vdouble x) {
   static const vdouble a0 = VBROADCAST(1.0);
   static const vdouble a1 = VBROADCAST(-1.666666666640169148537065260055e-1);
   static const vdouble a2 = VBROADCAST( 8.333333316490113523036717102793e-3);
   static const vdouble a3 = VBROADCAST(-1.984126600659171392655484413285e-4);
   static const vdouble a4 = VBROADCAST( 2.755690114917374804474016589137e-6);
   static const vdouble a5 = VBROADCAST(-2.502845227292692953118686710787e-8);
   static const vdouble a6 = VBROADCAST( 1.538730635926417598443354215485e-10);

   vdouble xx = x*x;
   return x*(a0 + xx*(a1 + xx*(a2 + xx*(a3 + xx*(a4 + xx*(a5 + xx*a6))))));
}

static inline vlong shiftRight(vlong x, int bits) {
#if VECLEN == 4
   __m128i lo = (__m128i)_mm256_extractf128_si256((__m256i)x, 0);
   __m128i hi = (__m128i)_mm256_extractf128_si256((__m256i)x, 1);
   return (vlong)
      _mm256_insertf128_si256(
         _mm256_castsi128_si256(_mm_srli_epi64(lo, bits)),
         _mm_srli_epi64(hi, bits),
         1);
#elif VECLEN == 2
   return (vlong)_mm_srli_epi64((__m128i)x, bits);
#endif
}

static inline vlong shiftLeft(vlong x, int bits) {
#if VECLEN == 4
   __m128i lo = (__m128i)_mm256_extractf128_si256((__m256i)x, 0);
   __m128i hi = (__m128i)_mm256_extractf128_si256((__m256i)x, 1);
   return (vlong)
      _mm256_insertf128_si256(
         _mm256_castsi128_si256(_mm_slli_epi64(lo, bits)),
         _mm_slli_epi64(hi, bits),
         1);
#elif VECLEN == 2
   return (vlong)_mm_slli_epi64((__m128i)x, bits);
#endif
}

static const vlong EXPONENT_MASK = VBROADCAST(0x7ff0000000000000L);
static const vlong EXPONENT_BIAS = VBROADCAST(0x00000000000003ffL);
static const vlong EXPONENT_ZERO = VBROADCAST(0x3ff0000000000000L);
static const vdouble FIXED_SCALE = VBROADCAST(9.31322574615478515625e-10); // 2^-30
static const vdouble LOG2ERECIP = VBROADCAST(0.6931471805599453094172);
static const int EXPONENT_SHIFT = 52;

// Extract IEEE754 double exponent as integer.
static inline vlong extractExponent(vdouble x) {
   return shiftRight((vlong)x & EXPONENT_MASK, EXPONENT_SHIFT) - EXPONENT_BIAS;
}

// Set IEEE754 double exponent to zero.
static inline vdouble clearExponent(vdouble x) {
   return (vdouble)(((vlong)x & ~EXPONENT_MASK) | EXPONENT_ZERO);
}

// Compute log(x) using SSE2 intrinsics to >= 30 bit precision, except
// denorms.
vdouble sse_log(vdouble base) {
   vlong iLog = extractExponent(base);
   vlong fLog = VBROADCAST(0);

   base = clearExponent(base);
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   fLog = shiftLeft(fLog, 10);
   fLog |= extractExponent(base);

   base = clearExponent(base);
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   fLog = shiftLeft(fLog, 10);
   fLog |= extractExponent(base);

   base = clearExponent(base);
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   base = base*base;
   fLog = shiftLeft(fLog, 10);
   fLog |= extractExponent(base);

   // No _mm_cvtepi64_pd() exists so use 32-bit conversion to double.
#if VECLEN == 4
   __m128i iLogLo = _mm256_extractf128_si256((__m256i)iLog, 0);
   __m128i iLogHi = _mm256_extractf128_si256((__m256i)iLog, 1);
   iLogLo = _mm_srli_si128(_mm_shuffle_epi32(iLogLo, 0x80), 8);
   iLogHi = _mm_slli_si128(_mm_shuffle_epi32(iLogHi, 0x08), 8);

   __m128i fLogLo = _mm256_extractf128_si256((__m256i)fLog, 0);
   __m128i fLogHi = _mm256_extractf128_si256((__m256i)fLog, 1);
   fLogLo = _mm_srli_si128(_mm_shuffle_epi32(fLogLo, 0x80), 8);
   fLogHi = _mm_slli_si128(_mm_shuffle_epi32(fLogHi, 0x08), 8);

   vdouble result = _mm256_cvtepi32_pd(iLogHi | iLogLo) +
      FIXED_SCALE*_mm256_cvtepi32_pd(fLogHi | fLogLo);
#elif VECLEN == 2
   iLog = (vlong)_mm_shuffle_epi32((__m128i)iLog, 0x8);
   fLog = (vlong)_mm_shuffle_epi32((__m128i)fLog, 0x8);

   vdouble result = _mm_cvtepi32_pd((__m128i)iLog) +
      FIXED_SCALE*_mm_cvtepi32_pd((__m128i)fLog);
#endif

   // Convert from base 2 logarithm and extract result.
   return LOG2ERECIP*result;
}

// Original computation.
double fallback(double u1, double u2) {
   double w1 = M_PI*(u1-1/2.0);
   double w2 = -log(u2);
   return tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
}

int main() {
   static const vdouble ZERO = VBROADCAST(0.0)
   static const vdouble ONE = VBROADCAST(1.0);
   static const vdouble ONE_HALF = VBROADCAST(0.5);
   static const vdouble PI = VBROADCAST(M_PI);
   static const vdouble PI_2 = VBROADCAST(M_PI_2);

   int i,j;
   vdouble x = ZERO;
   for(i = 0; i < 100000000; i += VECLEN) {
      vdouble u1, u2;
      for (j = 0; j < VECLEN; ++j) {
         ((double *)&u1)[j] = mt_drand();
         ((double *)&u2)[j] = mt_drand();
      }

      vdouble w1 = PI*(u1 - ONE_HALF);
      vdouble w2 = -sse_log(u2);

      vdouble sin_w1 = sse_sin(w1);
      vdouble sin2_w1 = sin_w1*sin_w1;

#if VECLEN == 4
      int nearOne = _mm256_movemask_pd(sin2_w1 >= FALLBACK_THRESHOLD);
#elif VECLEN == 2
      int nearOne = _mm_movemask_pd(sin2_w1 >= FALLBACK_THRESHOLD);
#endif
      if (!nearOne) {
#if VECLEN == 4
         vdouble cos_w1 = _mm256_sqrt_pd(ONE - sin2_w1);
#elif VECLEN == 2
         vdouble cos_w1 = _mm_sqrt_pd(ONE - sin2_w1);
#endif
         vdouble tan_w1 = sin_w1/cos_w1;

         x += tan_w1*(PI_2 - w1) + sse_log(w2*cos_w1/(PI_2 - w1));
      }
      else {
         vdouble result;
         for (j = 0; j < VECLEN; ++j)
            ((double *)&result)[j] = fallback(((double *)&u1)[j], ((double *)&u2)[j]);
         x += result;
      }
   }

   double sum = 0.0;
   for (i = 0; i < VECLEN; ++i)
      sum += ((double *)&x)[i];

   printf("%lf\n", sum);
   return 0;
}

我遇到了一个恼人的问题 - ±pi / 2附近的sin近似误差可以将值略微放在[-1,1]之外，而（1）导致tan的计算当log参数接近0时，无效并且（2）导致超大效果。为避免这种情况，代码测试sin(w1)^2是否“接近”1，如果是，那么它将回退到原始的完全双精度路径。 “close”的定义位于程序顶部的FALLBACK_THRESHOLD中 - 我任意设置为0.999，它仍然会返回OP原始程序范围内的值，但对性能影响不大。

我已编辑代码以使用gcc-specific syntax extensions进行SIMD。如果您的编译器没有这些扩展，那么您可以返回编辑历史记录。如果在编译器中启用，代码现在使用AVX一次处理4个双打（而不是SSE2的2个双打）。

我的机器上没有调用mt_seed()来获得可重复结果的结果是：

Version   Time         Result
original  14.653 secs  -1917488837.945067
SSE        7.380 secs  -1917488837.396841
AVX        6.271 secs  -1917488837.422882

由于trancendental近似，SSE / AVX结果与原始结果不同是有道理的。我认为你应该能够调整FALLBACK_THRESHOLD来折衷精度和速度。我不确定为什么SSE和AVX结果彼此略有不同。

Answer 4

首先，进行一点改造。这是原来的总和：

for(i = 0; i < 100000000; ++i) {
    u1 = mt_drand();
    u2 = mt_drand();
    w1 = M_PI*(u1-1/2.0);
    w2 = -log(u2);
    x += tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1));
}

这笔总和在数学上是等价的：

for(i = 0; i < 100000000; ++i) {
    u1 = M_PI - mt_drand()* M_PI;
    u2 = mt_drand();
    x += u1 / tan (u1) + log (sin (u1) / u1) + log (- log (u2));
}

因为它应该相当于用1.0 - mt_rand（）替换mt_drand（），所以我们可以让u1 = mt_drand（）* M_PI。

for(i = 0; i < 100000000; ++i) {
    u1 = mt_drand()* M_PI;
    u2 = mt_drand();
    x += u1 / tan (u1) + log (sin (u1) / u1) + log (- log (u2));
}

这样就可以很好地分离出两个可以单独处理的随机变量函数; x + = f（u1）+ g（u2）。这两个功能在很长的范围内都非常流畅。对于说u1>，f非常顺利。对于较小的值，0.03和1 / f非常平滑。除了接近0或1的值之外，g是非常平滑的。因此我们可以使用let来说区间[0 .. 0.01]，[0.01 .. 0.02]等100个不同的近似值。除了选择正确的近似是耗时的。

要解决此问题：区间[0 .. 1]中的线性随机函数将在区间[0 .. 0.01]中具有一定数量的值，[0.01 .. 0.02]中的另一个数值等等上。我认为你可以通过假设正态分布来计算100,000,000个中的随机数落入区间[0 .. 0.01]。然后你计算剩余多少落入[0.01 .. 0.02]，依此类推。如果您计算出999,123个数字落入[0.00,0.01]，那么您将在区间中生成该数量的随机数，并对该区间中的所有数字使用相同的近似值。

要在区间[0.33 .. 0.34]中找到f（x）的近似值，作为示例，在[-1 .. 1]中近似f（0.335 + x / 200）。通过采用度数为n的插值多项式，在Chebysev节点处插值xk = cos（pi *（2k-1）/ 2n），您将获得相当好的结果。

顺便说一下，旧的x87三角函数和对数运算的性能慢。绝对无法评估低次多项式。并且间隔足够小，您不需要高多项式度。

Answer 5

How does C compute sin() and other math functions?
不太可行。一个32位精度的表（这意味着你想要固定点数学不是双倍，但我离题必须是（2 ^ 32）* 4字节长。如果你的“32位精度，你可能会缩小一些“输出不是输入（AKA，0到2PI的输入范围用＆lt; 32位表示，这是你能够表示0到2PI之外的角度的唯一方法。）这将超过内存非64位计算机的空间，以及许多计算机的RAM空间。

Answer 6

就像您所说的那样，sine，cosine和tangent等一些超越函数可用作x86架构中的汇编指令。这些可能是C库如何实现sin()，cos()，tan()和朋友。

然而，我曾经做过一些摆弄这些指令，重新实现宏的功能，并删除每个错误检查和验证，只留下最低限度。针对C库进行测试，我记得我的宏函数速度非常快。以下是我的自定义切线函数的示例（原谅Visual Studio程序集语法）：

#define machine_tan_d(result, x)\
__asm {\
    fld qword ptr [x]\
    fptan\
    fstp st(0)\
    fstp qword ptr [result]\
}

因此，如果您愿意做出一些假设，删除错误处理/验证并使您的代码平台具体，那么您可以通过使用像我这样的宏函数来挤压几个周期。

现在关于第二个主题，使用查找表，我不会因为你将使用整数运算而更加快速。整数表会在数据高速缓存中产生额外的开销，可能导致比浮动操作更频繁的高速缓存未命中和最差的执行时间。但是，这当然只能通过仔细的分析和基准测试来推断。

Answer 7

处理器可能实现tan（）和cos（）作为x86 / 87的本地指令（硬连线或微码）FPTAN（x87 +）和FCOS（387+）（87来自原始数学协处理器，英特尔8087）。

理想情况下，您的环境应生成并执行本机x87指令，即FCOS和FPTAN（部分棕褐色）。您可以使用带有-S的{{1}}标志来保存生成的汇编代码，以显式生成汇编语言输出并搜索这些指令。如果没有，请验证标志的使用情况，以便为gcc生成正确的处理器submodel（或可用的壁橱）。

我不相信有任何SIMD指令集（MMX，SSE，3dNow等）处理函数如log（），tan（），cos（），所以这不是（直接）选项，但SIMD指令非常适合从先前计算的结果或表中进行插值。

另一种方法是尝试使用GCC编译器提供的一些数学优化选项。例如-ffast-math如果您不理解其含义可能会很危险。如果速度问题仅与x87的本机80位扩展精度和64位IEEE 754标准gcc精度数之间的差异有关，则舍入选项可能就足够了。

我不希望您能够轻松地编写适合32位浮点数或定点数的近似值，并使其比本机FPU指令更快。目前尚不清楚您需要/想要遵循特定分布曲线的准确程度，正如大多数与PRNG相关的事情一样，魔鬼在细节上。

虽然确保您至少使用基本elementary（超越）数学函数的本机程序集浮点指令是一个很好的起点，但最好的性能改进可能是利用数据简化，如{{他们的答案中有3}}和tmyklebu。

接下来，创建非均匀分布函数的近似值，如下所示 @tmyklebu在他们的gnasher729和其他人中使用此answer分发函数创建minimax approximation将是最好的方法。这不是创建单个基本数学函数（log，cos等）的近似值，而是创建整个分布映射函数的单个多项式近似。

除此之外，我还推荐了两本书，用于现代浮点方法和算法，Remez Algorithm和Elementary Functions, Algorithms and Implementation, 2nd ed.都由Jean-Michel Muller编辑（第二个标题的编辑）。第一个是更加面向实现，而第二个是非常全面但仍然易于理解。

使用这些书中的任何一本，您都应该能够理解精确度与速度之间的权衡取舍并编写充分的实现。

就个人而言，我不推荐使用Hart的计算机近似值（1968年，或1978年转载），它太过于过时，而且与现代计算机硬件相比太过于推荐很容易找到用于实时编程的二手或库副本，或Jack Crenshaw的Math Toolkit，它实际上面向非精密嵌入式应用程序。

Jack Ganssle有两篇介绍嵌入式应用程序的近似值，Handbook of Floating-Point Arithmetic和Approximations for Roots and Exponentials（PDF）。虽然我绝对不推荐32（+）位处理器的给定公式，特别是如果它们有FPU，它们是对基础知识的温和介绍。

Answer 8

1）“这取决于”......取决于编译器而不是芯片的架构。 2）回到过去，使用CORDIC方法实现trig函数很受欢迎。 http://en.wikipedia.org/wiki/CORDIC

如何加速棘手的随机数生成

8 个答案: