我一直在寻找使用scipy中的三次样条函数来平滑一些概率密度函数(PDF),并且我遇到问题时scipy.interpolate.InterpolatedUnivariateSpline
或scipy.interpolate.interp1d
没有通过所有结小的y值。
import math
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.interpolate import InterpolatedUnivariateSpline, interp1d
def main():
breadth = 0.60
stepsize = 0.01
mean = 0.0
stdev = 0.02
steps = 2. * breadth / stepsize + 1
retRange = np.linspace(-breadth, breadth, num=steps)
probs = {}
for ret in retRange:
retup = ret + stepsize / 2.
retdown = ret - stepsize / 2.
probup = norm.cdf(retup, loc=mean, scale=stdev)
probdown = norm.cdf(retdown, loc=mean, scale=stdev)
prob = probup - probdown
probs[ret] = prob
probs = pd.Series(probs, name='PDF')
probs = probs / probs.sum()
spl = InterpolatedUnivariateSpline(probs.index, probs, k=3, ext='zeros')
tmp = pd.Series(index=retRange, data=retRange, name='SmoothedPDF').map(spl)
out = pd.concat([probs, tmp], axis=1)
out['diff'] = out['PDF'] - out['SmoothedPDF']
print out
if __name__ == '__main__':
main()
问题是输出DataFrame显示差异。我可以替换interp1d
代替InterpolateUnivariateSpline
并获得相同的问题(但结果不同)。问题是值得注意的是,结点处的插值y值的差异可能大于原始的y值。
PDF SmoothedPDF diff
-0.60 8.670691e-195 -2.650272e-50 2.650272e-50
-0.59 2.245130e-188 2.120217e-49 -2.120217e-49
-0.58 4.528786e-182 -9.806006e-49 9.806006e-49
-0.57 7.116703e-176 3.816392e-48 -3.816392e-48
-0.56 8.712391e-170 -1.431147e-47 1.431147e-47
-0.55 8.309255e-164 5.342948e-47 -5.342948e-47
-0.54 6.173881e-158 -1.994065e-46 1.994065e-46
-0.53 3.573809e-152 7.441963e-46 -7.441963e-46
-0.52 1.611710e-146 -2.777379e-45 2.777379e-45
-0.51 5.662799e-141 1.036532e-44 -1.036532e-44
-0.50 1.550138e-135 -3.868390e-44 3.868390e-44
-0.49 3.306066e-130 1.443703e-43 -1.443703e-43
-0.48 5.493659e-125 -5.387972e-43 5.387972e-43
-0.47 7.112603e-120 2.010818e-42 -2.010818e-42
-0.46 7.174975e-115 -7.504477e-42 7.504477e-42
-0.45 5.639566e-110 2.800709e-41 -2.800709e-41
-0.44 3.453932e-105 -1.045239e-40 1.045239e-40
-0.43 1.648294e-100 3.900884e-40 -3.900884e-40
-0.42 6.129407e-96 -1.455830e-39 1.455830e-39
-0.41 1.776139e-91 5.433231e-39 -5.433231e-39
-0.40 4.010714e-87 -2.027709e-38 2.027709e-38
-0.39 7.057745e-83 7.567514e-38 -7.567514e-38
-0.38 9.678846e-79 -2.824235e-37 2.824235e-37
-0.37 1.034450e-74 1.054019e-36 -1.054019e-36
-0.36 8.616667e-71 -3.933652e-36 3.933652e-36
-0.35 5.594107e-67 1.468059e-35 -1.468059e-35
-0.34 2.830755e-63 -5.478870e-35 5.478870e-35
-0.33 1.116539e-59 2.044742e-34 -2.044742e-34
-0.32 3.432949e-56 -7.631082e-34 7.631082e-34
-0.31 8.228222e-53 2.847958e-33 -2.847958e-33
... ... ... ...
0.31 0.000000e+00 -7.461634e-41 7.461634e-41
0.32 0.000000e+00 -3.730817e-41 3.730817e-41
0.33 0.000000e+00 -5.022254e-42 5.022254e-42
0.34 0.000000e+00 3.407958e-42 -3.407958e-42
0.35 0.000000e+00 8.968310e-43 -8.968310e-43
0.36 0.000000e+00 -2.578389e-43 2.578389e-43
0.37 0.000000e+00 -1.177091e-43 1.177091e-43
0.38 0.000000e+00 9.809089e-45 -9.809089e-45
0.39 0.000000e+00 -3.152922e-45 3.152922e-45
0.40 0.000000e+00 -4.816963e-46 4.816963e-46
0.41 0.000000e+00 -6.568587e-47 6.568587e-47
0.42 0.000000e+00 4.926440e-47 -4.926440e-47
0.43 0.000000e+00 9.579189e-48 -9.579189e-48
0.44 0.000000e+00 3.763253e-48 -3.763253e-48
0.45 0.000000e+00 5.131708e-49 -5.131708e-49
0.46 0.000000e+00 -1.924391e-49 1.924391e-49
0.47 0.000000e+00 -1.603659e-49 1.603659e-49
0.48 0.000000e+00 -2.405488e-50 2.405488e-50
0.49 0.000000e+00 -1.202744e-50 1.202744e-50
0.50 0.000000e+00 -1.586954e-51 1.586954e-51
0.51 0.000000e+00 -2.923336e-52 2.923336e-52
0.52 0.000000e+00 -3.132146e-53 3.132146e-53
0.53 0.000000e+00 2.610122e-54 -2.610122e-54
0.54 0.000000e+00 3.915183e-54 -3.915183e-54
0.55 0.000000e+00 -1.631326e-54 1.631326e-54
0.56 0.000000e+00 -1.386627e-54 1.386627e-54
0.57 0.000000e+00 -2.650905e-55 2.650905e-55
0.58 0.000000e+00 1.835242e-55 -1.835242e-55
0.59 0.000000e+00 4.078315e-56 -4.078315e-56
0.60 0.000000e+00 0.000000e+00 0.000000e+00
[121 rows x 3 columns]
是的,我正在研究一些涉及非常低概率的事情,因此规模确实(有时)很小。
如果这是一个精确的问题,我可以做一些重新调整,但在此之前,我想确认这是问题。
作为旁注,使用线性插值(k = 1)不会显示此问题。
答案 0 :(得分:0)
问题似乎是您尝试使用全局插值的样条插值器,这非常适合您的输入。由于指数截止,由于双精度,你的很多值都是零;在此范围内,样条插值器会为您提供噪声。
这里是问题的数字表示,通过在半对数刻度上绘制PDF及其样条插值(请注意,如果PDF正好为零,则会在对数图中得到缺失点):
import matplotlib.pyplot as plt
plt.figure()
plt.semilogy(out.index, out.PDF, 'r-', out.index, out.SmoothedPDF.abs(), 'bo--',
markerfacecolor='none') # note the abs for log
plt.legend(['PDF', 'SmoothedPDF (univariate spline)'])
plt.show()
首先,我需要在插值上调用.abs()
以绘制负(我们重新插入非零函数!)。其次,正如我已经指出的那样,x = 0.2以上的范围应该是缺失的,因为这里的函数值正好为零。相反,我们的噪声大约为1e-50,就像小x限制一样。总而言之,我们所看到的是,如果您的数据小于1e-30,那么您就不会有机会。
我的第一个想法是使用griddata
,但它显示了类似的错误结果(即使在这种情况下数字错误要小得多)。那么如何使用本地插值器呢?输入piecewise cubic Hermite spline。使用稍微修改的代码版本,以演示插值和数值稳定性:
import math
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.interpolate import pchip_interpolate
import matplotlib.pyplot as plt
def main():
breadth = 0.60
stepsize = 0.01
mean = 0.0
stdev = 0.02
steps = 2. * breadth / stepsize + 1
bigRange = np.linspace(-breadth, breadth, num=5*steps)
retRange = bigRange[::5]
probs = {}
for ret in retRange:
retup = ret + stepsize / 2.
retdown = ret - stepsize / 2.
probup = norm.cdf(retup, loc=mean, scale=stdev)
probdown = norm.cdf(retdown, loc=mean, scale=stdev)
prob = probup - probdown
probs[ret] = prob
probs = pd.Series(probs, name='PDF')
probs = probs / probs.sum()
tmp_big = pd.Series(index=bigRange,
data=pchip_interpolate(probs.index.values, probs.values, bigRange),
name='SmoothedPDF_big') # denser interpolated case
tmp_check = pd.Series(index=retRange,
data=pchip_interpolate(probs.index.values, probs.values, retRange),
name='SmoothedPDF') # base points for checking stability
out = pd.concat([probs, tmp_check], axis=1)
out['diff'] = out['PDF'] - out['SmoothedPDF']
return out,tmp_big
if __name__ == "__main__":
out,pchipout = main()
plt.figure()
plt.semilogy(out.index, out.PDF, 'r-', pchipout.index, pchipout.abs().values, 'bo--',
markerfacecolor='none')
plt.legend(['PDF', 'SmoothedPDF (piecewise cubic Hermit spline)'])
plt.figure()
plt.plot(out.index, out.PDF, 'r-', pchipout.index, pchipout.values, 'bo--',
markerfacecolor='none')
plt.legend(['PDF', 'SmoothedPDF (piecewise cubic Hermit spline)'])
plt.show()
稳定性证明:
>>> out['diff'].any()
False
误差范围的对数图:
如您所见,协议几乎是完美的。请注意原始数据正好为0时明显缺少值。在您需要学习的小x范围内,低于1e-100的值仍然存在噪音。
插值结果的实际图表: