如何有效地生成Zipf分布式数字?

时间:2012-04-02 20:18:23

标签: c++ random

我目前正在使用C ++对一些数据结构进行基准测试,我想在处理Zipf分布式数字时测试它们。

我正在使用此网站上提供的生成器:http://www.cse.usf.edu/~christen/tools/toolpage.html

我调整了实现以使用Mersenne Twister生成器。

效果很好,但确实很慢。在我的情况下,范围可以很大(大约一百万),并且生成的随机数的数量可以是几百万。

alpha参数不会随时间变化,它是固定的。

我试图预先计算所有sum_prob。它的速度要快得多,但在大范围内仍然会变慢。

是否有更快的方法来生成Zipf分布式数字?即便是不太精确的东西也会受到欢迎。

由于

6 个答案:

答案 0 :(得分:3)

对于n的每次调用,代码中的以下行都会执行zipf()次:

sum_prob = sum_prob + c / pow((double) i, alpha);

遗憾的是,有必要调用pow()函数,因为在内部,此函数不是一个而是两个泰勒序列[考虑pow(x, alpha) == exp(alpha*log(x))]。如果alpha当然是一个整数,那么你可以通过简单的乘法替换pow()来大大加快代码的速度。如果alpha是一个有理数,那么你可以通过编码Newton-Raphson迭代代替两个泰勒级数来加快代码的速度。如果最后一个条件成立,请告知。

幸运的是,您已表明alpha没有变化。您是否可以通过准备pow((double) i, alpha)的表格来快速增加代码,然后让zipf()查看数字?这样,zipf()根本不需要调用pow()。我怀疑这会节省大量时间。

还有可能进一步改进。如果您从sumprob()中计算出zipf()函数,该怎么办?您是否可以为sumprob()的使用准备一个更具侵略性的查找表?

也许其中一些想法会让你朝着正确的方向前进。看看你不能用它们做什么。

更新:我看到您现在修改的问题可能无法使用此答案。从目前来看,你的问题可能会在复杂变量理论中解决。如你所知,这些问题通常不容易。可能是一个足够聪明的数学家发现了相关的递归关系或某些技巧,如正常分布的Box-Muller技术,但如果是这样,我不熟悉这种技术。祝好运。 (这对你来说可能并不重要,但是,如果确实如此,已故的NN Lebedev 1972年出版的优秀书籍特殊功能及其应用可以从俄语中以廉价的平装版提供英文翻译。如果你真的,真的想破解这个问题,你可能会读下一个Lebedev - 但是,当然,这是一种绝望的措施,不是吗?)

答案 1 :(得分:3)

仅靠预先计算并没有多大帮助。但是很明显,sum_prob是累积的并且具有升序。因此,如果我们使用二进制搜索来找到zipf_value,我们将减少从O(n)到O(log(n))生成Zipf分布数的顺序。这在效率方面有了很大提高。

在这里,只需将zipf()中的genzipf.c函数替换为以下函数:

int zipf(double alpha, int n)
{
  static int first = TRUE;      // Static first time flag
  static double c = 0;          // Normalization constant
  static double *sum_probs;     // Pre-calculated sum of probabilities
  double z;                     // Uniform random number (0 < z < 1)
  int zipf_value;               // Computed exponential value to be returned
  int    i;                     // Loop counter
  int low, high, mid;           // Binary-search bounds

  // Compute normalization constant on first call only
  if (first == TRUE)
  {
    for (i=1; i<=n; i++)
      c = c + (1.0 / pow((double) i, alpha));
    c = 1.0 / c;

    sum_probs = malloc((n+1)*sizeof(*sum_probs));
    sum_probs[0] = 0;
    for (i=1; i<=n; i++) {
      sum_probs[i] = sum_probs[i-1] + c / pow((double) i, alpha);
    }
    first = FALSE;
  }

  // Pull a uniform random number (0 < z < 1)
  do
  {
    z = rand_val(0);
  }
  while ((z == 0) || (z == 1));

  // Map z to the value
  low = 1, high = n, mid;
  do {
    mid = floor((low+high)/2);
    if (sum_probs[mid] >= z && sum_probs[mid-1] < z) {
      zipf_value = mid;
      break;
    } else if (sum_probs[mid] >= z) {
      high = mid-1;
    } else {
      low = mid+1;
    }
  } while (low <= high);

  // Assert that zipf_value is between 1 and N
  assert((zipf_value >=1) && (zipf_value <= n));

  return(zipf_value);
}

答案 2 :(得分:2)

我能找到的唯一的C ++ 11 Zipf随机生成器明确地计算了概率并使用了std::discrete_distribution。这适用于小范围,但如果您需要生成范围非常广的Zipf值(在我的情况下用于数据库测试),则无用,因为它会耗尽内存。所以,我在C ++中实现了下面提到的算法。

我没有严格测试过这段代码,并且可能会进行一些优化,但它只需要恒定的空间并且似乎运行良好。

#include <algorithm>
#include <cmath>
#include <random>

/** Zipf-like random distribution.
 *
 * "Rejection-inversion to generate variates from monotone discrete
 * distributions", Wolfgang Hörmann and Gerhard Derflinger
 * ACM TOMACS 6.3 (1996): 169-184
 */
template<class IntType = unsigned long, class RealType = double>
class zipf_distribution
{
public:
    typedef RealType input_type;
    typedef IntType result_type;

    static_assert(std::numeric_limits<IntType>::is_integer, "");
    static_assert(!std::numeric_limits<RealType>::is_integer, "");

    zipf_distribution(const IntType n=std::numeric_limits<IntType>::max(),
                      const RealType q=1.0)
        : n(n)
        , q(q)
        , H_x1(H(1.5) - 1.0)
        , H_n(H(n + 0.5))
        , dist(H_x1, H_n)
    {}

    IntType operator()(std::mt19937& rng)
    {
        while (true) {
            const RealType u = dist(rng);
            const RealType x = H_inv(u);
            const IntType  k = clamp<IntType>(std::round(x), 1, n);
            if (u >= H(k + 0.5) - h(k)) {
                return k;
            }
        }
    }

private:
    /** Clamp x to [min, max]. */
    template<typename T>
    static constexpr T clamp(const T x, const T min, const T max)
    {
        return std::max(min, std::min(max, x));
    }

    /** exp(x) - 1 / x */
    static double
    expxm1bx(const double x)
    {
        return (std::abs(x) > epsilon)
            ? std::expm1(x) / x
            : (1.0 + x/2.0 * (1.0 + x/3.0 * (1.0 + x/4.0)));
    }

    /** H(x) = log(x) if q == 1, (x^(1-q) - 1)/(1 - q) otherwise.
     * H(x) is an integral of h(x).
     *
     * Note the numerator is one less than in the paper order to work with all
     * positive q.
     */
    const RealType H(const RealType x)
    {
        const RealType log_x = std::log(x);
        return expxm1bx((1.0 - q) * log_x) * log_x;
    }

    /** log(1 + x) / x */
    static RealType
    log1pxbx(const RealType x)
    {
        return (std::abs(x) > epsilon)
            ? std::log1p(x) / x
            : 1.0 - x * ((1/2.0) - x * ((1/3.0) - x * (1/4.0)));
    }

    /** The inverse function of H(x) */
    const RealType H_inv(const RealType x)
    {
        const RealType t = std::max(-1.0, x * (1.0 - q));
        return std::exp(log1pxbx(t) * x);
    }

    /** That hat function h(x) = 1 / (x ^ q) */
    const RealType h(const RealType x)
    {
        return std::exp(-q * std::log(x));
    }

    static constexpr RealType epsilon = 1e-8;

    IntType                                  n;     ///< Number of elements
    RealType                                 q;     ///< Exponent
    RealType                                 H_x1;  ///< H(x_1)
    RealType                                 H_n;   ///< H(n)
    std::uniform_real_distribution<RealType> dist;  ///< [H(x_1), H(n)]
};

答案 3 :(得分:1)

与此同时,基于拒绝反转采样的方法更快,请参阅代码here

答案 4 :(得分:1)

作为上面给出的非常好的拒绝-反转实现的补充,这是一个C ++类,具有相同的API,对于少数垃圾箱来说,更简单,更快捷。在我的机器上,对于N = 300,它的速度约为2.3倍。它更快,因为它执行直接表查找,而不是计算日志和能力。该表吞噬了高速缓存,但是...根据我CPU的d-cache的大小进行猜测,我想也许上面给出的正确的拒绝-反转算法对于N = 35K左右的速度会更快。另外,初始化表需要为每个bin调用std::pow(),因此仅当从中提取超过N个值时,这才有性能。否则,拒绝反转会更快。明智地选择。

(我已经设置了API,因此它看起来很像std :: c ++标准委员会可能想出的东西。)

/**
 * Example usage:
 *
 *    std::random_device rd;
 *    std::mt19937 gen(rd());
 *    zipf_table_distribution<> zipf(300);
 *
 *    for (int i = 0; i < 100; i++)
 *        printf("draw %d %d\n", i, zipf(gen));
 */
template<class IntType = unsigned long, class RealType = double>
class zipf_table_distribution
{
   public:
      typedef IntType result_type;

      static_assert(std::numeric_limits<IntType>::is_integer, "");
      static_assert(!std::numeric_limits<RealType>::is_integer, "");

      /// zipf_table_distribution(N, s)
      /// Zipf distribution for `N` items, in the range `[1,N]` inclusive.
      /// The distribution follows the power-law 1/n^s with exponent `s`.
      /// This uses a table-lookup, and thus provides values more
      /// quickly than zipf_distribution. However, the table can take
      /// up a considerable amount of RAM, and initializing this table
      /// can consume significant time.
      zipf_table_distribution(const IntType n,
                              const RealType q=1.0) :
         _n(init(n,q)),
         _q(q),
         _dist(_pdf.begin(), _pdf.end())
      {}
      void reset() {}

      IntType operator()(std::mt19937& rng)
      {
         return _dist(rng);
      }

      /// Returns the parameter the distribution was constructed with.
      RealType s() const { return _q; }
      /// Returns the minimum value potentially generated by the distribution.
      result_type min() const { return 1; }
      /// Returns the maximum value potentially generated by the distribution.
      result_type max() const { return _n; }

   private:
      std::vector<RealType>               _pdf;  ///< Prob. distribution
      IntType                             _n;    ///< Number of elements
      RealType                            _q;    ///< Exponent
      std::discrete_distribution<IntType> _dist; ///< Draw generator

      /** Initialize the probability mass function */
      IntType init(const IntType n, const RealType q)
      {
         _pdf.reserve(n+1);
         _pdf.emplace_back(0.0);
         for (IntType i=1; i<=n; i++)
            _pdf.emplace_back(std::pow((double) i, -q));
         return n;
      }
};

答案 5 :(得分:1)

这里的版本比drobilla的原始文章快2倍,并且它还支持非零变形参数q(又名Hurwicz q,q系列q或量子组变形q),并更改表示法以符合数论教科书中的标准用法。经过严格测试;请参阅https://github.com/opencog/cogutil/blob/master/tests/util/zipfUTest.cxxtest

上的单元测试

双重许可证MIT许可证或Gnu Affero,请根据需要复制到C ++标准中。

/**
 * Zipf (Zeta) random distribution.
 *
 * Implementation taken from drobilla's May 24, 2017 answer to
 * https://stackoverflow.com/questions/9983239/how-to-generate-zipf-distributed-numbers-efficiently
 *
 * That code is referenced with this:
 * "Rejection-inversion to generate variates from monotone discrete
 * distributions", Wolfgang Hörmann and Gerhard Derflinger
 * ACM TOMACS 6.3 (1996): 169-184
 *
 * Note that the Hörmann & Derflinger paper, and the stackoverflow
 * code base incorrectly names the paramater as `q`, when they mean `s`.
 * Thier `q` has nothing to do with the q-series. The names in the code
 * below conform to conventions.
 *
 * Example usage:
 *
 *    std::random_device rd;
 *    std::mt19937 gen(rd());
 *    zipf_distribution<> zipf(300);
 *
 *    for (int i = 0; i < 100; i++)
 *        printf("draw %d %d\n", i, zipf(gen));
 */

template<class IntType = unsigned long, class RealType = double>
class zipf_distribution
{
   public:
      typedef IntType result_type;

      static_assert(std::numeric_limits<IntType>::is_integer, "");
      static_assert(!std::numeric_limits<RealType>::is_integer, "");

      /// zipf_distribution(N, s, q)
      /// Zipf distribution for `N` items, in the range `[1,N]` inclusive.
      /// The distribution follows the power-law 1/(n+q)^s with exponent
      /// `s` and Hurwicz q-deformation `q`.
      zipf_distribution(const IntType n=std::numeric_limits<IntType>::max(),
                        const RealType s=1.0,
                        const RealType q=0.0)
         : n(n)
         , _s(s)
         , _q(q)
         , oms(1.0-s)
         , spole(abs(oms) < epsilon)
         , rvs(spole ? 0.0 : 1.0/oms)
         , H_x1(H(1.5) - h(1.0))
         , H_n(H(n + 0.5))
         , cut(1.0 - H_inv(H(1.5) - h(1.0)))
         , dist(H_x1, H_n)
      {
         if (-0.5 >= q)
            throw std::runtime_error("Range error: Parameter q must be greater than -0.5!");
      }
      void reset() {}

      IntType operator()(std::mt19937& rng)
      {
         while (true)
         {
            const RealType u = dist(rng);
            const RealType x = H_inv(u);
            const IntType  k = std::round(x);
            if (k - x <= cut) return k;
            if (u >= H(k + 0.5) - h(k))
               return k;
         }
      }

      /// Returns the parameter the distribution was constructed with.
      RealType s() const { return _s; }
      /// Returns the Hurwicz q-deformation parameter.
      RealType q() const { return _q; }
      /// Returns the minimum value potentially generated by the distribution.
      result_type min() const { return 1; }
      /// Returns the maximum value potentially generated by the distribution.
      result_type max() const { return n; }


   private:
      IntType    n;     ///< Number of elements
      RealType   _s;    ///< Exponent
      RealType   _q;    ///< Deformation
      RealType   oms;   ///< 1-s
      bool       spole; ///< true if s near 1.0
      RealType   rvs;   ///< 1/(1-s)
      RealType   H_x1;  ///< H(x_1)
      RealType   H_n;   ///< H(n)
      RealType   cut;   ///< rejection cut
      std::uniform_real_distribution<RealType> dist;  ///< [H(x_1), H(n)]

      // This provides 16 decimal places of precision,
      // i.e. good to (epsilon)^4 / 24 per expanions log, exp below.
      static constexpr RealType epsilon = 2e-5;

      /** (exp(x) - 1) / x */
      static double
      expxm1bx(const double x)
      {
         if (std::abs(x) > epsilon)
            return std::expm1(x) / x;
         return (1.0 + x/2.0 * (1.0 + x/3.0 * (1.0 + x/4.0)));
      }

      /** log(1 + x) / x */
      static RealType
      log1pxbx(const RealType x)
      {
         if (std::abs(x) > epsilon)
            return std::log1p(x) / x;
         return 1.0 - x * ((1/2.0) - x * ((1/3.0) - x * (1/4.0)));
      }
      /**
       * The hat function h(x) = 1/(x+q)^s
       */
      const RealType h(const RealType x)
      {
         return std::pow(x + _q, -_s);
      }

      /**
       * H(x) is an integral of h(x).
       *     H(x) = [(x+q)^(1-s) - (1+q)^(1-s)] / (1-s)
       * and if s==1 then
       *     H(x) = log(x+q) - log(1+q)
       *
       * Note that the numerator is one less than in the paper
       * order to work with all s. Unfortunately, the naive
       * implementation of the above hits numerical underflow
       * when q is larger than 10 or so, so we split into
       * different regimes.
       *
       * When q != 0, we shift back to what the paper defined:

       *    H(x) = (x+q)^{1-s} / (1-s)
       * and for q != 0 and also s==1, use
       *    H(x) = [exp{(1-s) log(x+q)} - 1] / (1-s)
       */
      const RealType H(const RealType x)
      {
         if (not spole)
            return std::pow(x + _q, oms) / oms;

         const RealType log_xpq = std::log(x + _q);
         return log_xpq * expxm1bx(oms * log_xpq);
      }

      /**
       * The inverse function of H(x).
       *    H^{-1}(y) = [(1-s)y + (1+q)^{1-s}]^{1/(1-s)} - q
       * Same convergence issues as above; two regimes.
       *
       * For s far away from 1.0 use the paper version
       *    H^{-1}(y) = -q + (y(1-s))^{1/(1-s)}
       */
      const RealType H_inv(const RealType y)
      {
         if (not spole)
            return std::pow(y * oms, rvs) - _q;

         return std::exp(y * log1pxbx(oms * y)) - _q;
      }
};