Question

我正在尝试从Python中的对数正态分布生成随机样本，该应用程序用于模拟网络流量。我想生成以下样本：

模态样本结果为320（~10 ^ 2.5）

80％的样品在100至1000（10 ^ 2至10 ^ 3）的范围内

我的策略是使用逆CDF（或我相信的Smirnov变换）：

将PDF用于以2.5为中心的正态分布，计算10 ^ x的PDF，其中x~N（2.5，sigma）。
计算上述分配的CDF。
沿0到1的间隔生成随机统一数据。
使用反向CDF将随机统一数据转换为所需范围。

问题是，当我在最后计算10和90百分位时，我的数字完全错误。

这是我的代码：

%matplotlib inline

import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm

# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)

# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')

x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)

# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)

# Generate random uniform data
input = np.random.uniform(size=10000)

# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]

# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]

# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)

产生输出：

(223.99999999999997, 2480.0000000000009)

...而不是我希望看到的（100,1000）。任何建议表示赞赏！

Answer 1

首先，我不确定Use the PDF for a normal distribution centred around 2.5。毕竟，log-normal大约是基数e对数（又名自然对数），这意味着320 = 10 ^2.5 = e ^5.77。

其次，我会以不同的方式处理问题。您需要m和s从Log-Normal进行抽样。

如果你看一下上面的wiki文章，你会发现它是双参数分布。你有两个条件：

Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8

其中CDF通过错误函数表示（这是在任何库中发现的非常常见的函数）

两个参数的两个非线性方程。解决它们，找到m和s并将其放入任何标准对数正态采样

Answer 2

Severin的方法比我使用Smirnov变换的原始尝试更精简。这是适合我的代码（使用fsolve查找s，尽管手动执行它非常简单）：

# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation

%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

from scipy.optimize import fsolve
from scipy.stats import lognorm
import math

target_modal_value = 320

# Define function to find roots of
def equation(s):

    # From Wikipedia: Mode = exp(m - s*s) = 320
    m = math.log(target_modal_value) + s**2

    # Get probability mass from CDF at 100 and 1000, should equal to 0.8.
    # Rearange equation so that =0, to find root (value of s)
    return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)

# Solve non-linear equation to find s
s_initial_guess = 1
s =  fsolve(equation, s_initial_guess)

# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))

# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

如何使用逆CDF随机抽样Python中的对数正态数据并指定目标百分位数？

2 个答案: