如何使用逆CDF随机抽样Python中的对数正态数据并指定目标百分位数?

时间:2017-03-21 18:53:35

标签: python random statistics probability-density cdf

我正在尝试从Python中的对数正态分布生成随机样本,该应用程序用于模拟网络流量。我想生成以下样本:

  1. 模态样本结果为320(~10 ^ 2.5)
  2. 80%的样品在100至1000(10 ^ 2至10 ^ 3)的范围内

    我的策略是使用逆CDF(或我相信的Smirnov变换):

    1. 将PDF用于以2.5为中心的正态分布,计算10 ^ x的PDF,其中x~N(2.5,sigma)。
    2. 计算上述分配的CDF。
    3. 沿0到1的间隔生成随机统一数据。
    4. 使用反向CDF将随机统一数据转换为所需范围。
    5. 问题是,当我在最后计算10和90百分位时,我的数字完全错误。

      这是我的代码:

      %matplotlib inline
      
      import matplotlib
      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import scipy.stats
      from scipy.stats import norm
      
      # find value of mu and sigma so that 80% of data lies within range 2 to 3
      mu=2.505
      sigma = 1/2.505
      norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
      # output: (1.9934025, 3.01659743)
      
      # Generate normal distribution PDF
      x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
      x_log = np.log10(x)
      mu=2.505
      sigma = 1/2.505
      y = norm.pdf(x_log,loc=mu,scale=sigma)
      fig, ax = plt.subplots()
      ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
      
      x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
      fig, ax = plt.subplots()
      ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
      ax.set_xlim(0,2000)
      
      # Calculate CDF
      y_CDF = np.cumsum(y) / np.cumsum(y).max()
      fig, ax = plt.subplots()
      ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
      ax.set_xlim(0,8000)
      
      # Generate random uniform data
      input = np.random.uniform(size=10000)
      
      # Use CDF as lookup table
      traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
      
      # Discard highs and lows
      traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
      
      # Check percentiles
      np.percentile(traffic,10),np.percentile(traffic,90)
      

      产生输出:

      (223.99999999999997, 2480.0000000000009)
      

      ...而不是我希望看到的(100,1000)。任何建议表示赞赏!

2 个答案:

答案 0 :(得分:2)

首先,我不确定Use the PDF for a normal distribution centred around 2.5。毕竟,log-normal大约是基数e对数(又名自然对数),这意味着320 = 10 2.5 = e 5.77

其次,我会以不同的方式处理问题。您需要msLog-Normal进行抽样。

如果你看一下上面的wiki文章,你会发现它是双参数分布。你有两个条件:

Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8

其中CDF通过错误函数表示(这是在任何库中发现的非常常见的函数)

两个参数的两个非线性方程。解决它们,找到ms并将其放入任何标准对数正态采样

答案 1 :(得分:2)

Severin的方法比我使用Smirnov变换的原始尝试更精简。这是适合我的代码(使用fsolve查找s,尽管手动执行它非常简单):

# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation

%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

from scipy.optimize import fsolve
from scipy.stats import lognorm
import math

target_modal_value = 320

# Define function to find roots of
def equation(s):

    # From Wikipedia: Mode = exp(m - s*s) = 320
    m = math.log(target_modal_value) + s**2

    # Get probability mass from CDF at 100 and 1000, should equal to 0.8.
    # Rearange equation so that =0, to find root (value of s)
    return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)

# Solve non-linear equation to find s
s_initial_guess = 1
s =  fsolve(equation, s_initial_guess)

# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))

# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

Lognormal distribution with mode at 320