Question

我一直试图在R上做一个拟合曲线，但有一些问题。我正在处理构成x和y坐标的几个大数据集。当使用ggplot的geom_point或任何其他绘图函数绘制时，曲线趋向于类似于平方根函数的图形。

这是使用我使用的geom_smooth进行拟合的代码：

plt = ggplot(data = data2, aes(x = x, y = y)) + geom_point() +geom_smooth()

这基本上让我得到了这个：

Plot with Curve

有没有办法让曲线更像红色方根曲线（y = x ^ 0.5） - 基本上使它更平滑并适合某个公式？这是用作示例的最小数据集。

Example Data set CSV format

我也尝试将该方法作为黄土进行拟合，得到的曲线接近我想要的曲线，但对于数据集来说要大得多（约500,000-700,000点）或者某些点非常接近在某个地区密集包装黄土似乎不起作用。有一种倾向，即平均值有点偏斜，这是有道理的，因为该地区的大量积分正在推动它。但我需要拟合曲线并迫使它接近平方根曲线。我也试过弄乱跨度值，但这并没有真正影响曲线的平滑度。

Answer 1

我想到的一件事是以下几点。您最好的图表可能通过最小化卡方来评估。你可以对此加一个额外的标准，即这个拟合偏离平方根行为的程度。这可以通过sqrt()拟合解决方案来完成，并在拟合质量的总体评估中添加加权卡方。不知道怎么做R，但是在python中你得到这样的东西：蓝色图表将是最佳sqrt()拟合。黄色的是具有[0,0,.1,.2,.3,.4,.6,.9,.9,.9]结的最佳二次样条，即weight=0（你可以另外优化结位置，这里没有这样做）。然后我们分别通过sqrt()，weights = 0.5,1,2来衡量适合度的重要性。

代码如下：

import matplotlib
matplotlib.use('Qt4Agg')

from matplotlib import pyplot as plt
import numpy as np
from scipy.optimize import leastsq,curve_fit

###from the scipy doc page as I have scipy 0.16 and no build in BSpline, yet
def B(x, k, i, t):
    if k == 0:
        return 1.0 if t[i] <= x < t[i+1] else 0.0
    if t[i+k] == t[i]:
        c1 = 0.0
    else:
        c1 = (x - t[i])/(t[i+k] - t[i]) * B(x, k-1, i, t)
    if t[i+k+1] == t[i+1]:
        c2 = 0.0
    else:
        c2 = (t[i+k+1] - x)/(t[i+k+1] - t[i+1]) * B(x, k-1, i+1, t)
    return c1 + c2


def bspline(x, t, c, k):
    n = len(t) - k - 1
    assert (n >= k+1) and (len(c) >= n)
    return sum(c[i] * B(x, k, i, t) for i in range(n))


def mixed_res(params,points,weight):
    [xList,yList] = zip(*points)
    bSplList=[bspline(x,[0,0,.1,.2,.3,.4,.6,.9,.9,.9],params,2) for x in xList]
    ###standard chisq
    diffTrue=[y-b for y,b in zip(yList,bSplList)]
    ###how good can the spline be fitted with sqrt
    locfit,_=curve_fit(sqrtfunc,xList,bSplList)
    sqrtList=[sqrtfunc(x,locfit[0]) for x in xList]
    diffWeight=[ weight*(s-b) for s,b in zip(sqrtList,bSplList)]
    return diffTrue+diffWeight

def sqrtfunc(x,a):
    return a*np.sqrt(x)


xList,yList=np.loadtxt("PHOQSTACK.csv", unpack=True, delimiter=',')
xListSorted=sorted(xList)
zipData=zip(xList,yList)

fig=plt.figure(1)
ax=fig.add_subplot(1,1,1)

knotList=[0,0,.1,.2,.3,.4,.6,.9,.9,.9]
order=2

sqrtvalues,_=curve_fit(sqrtfunc,xList,yList)
th_sqrt_y=[sqrtfunc(x,sqrtvalues[0]) for x in xListSorted]

ax.scatter(xList,yList,s=1)
ax.plot(xListSorted,th_sqrt_y)

fitVals=[.2,.3,.4,.2,.3,.4,.2]
for s in [0,.5,1,2]:
    print s
    fitVals,ier=leastsq(mixed_res,fitVals,args=( zipData, s ) )
    th_b_y=[bspline(x,knotList,fitVals,order) for x in xListSorted]
    ax.plot(xListSorted,th_b_y)

plt.show()

问题是，对于较大的权重，拟合更加忙于将形状设置为sqrt而不是拟合实际数据，并且可能会遇到收敛问题。

第二种选择是直接制作拟合的sqrt部分并提供其相对贡献作为卡方的一部分。像以前一样的蓝色和黄色图表。其他人的权重与上述权重相同。

为此，我将剩余功能更改为

def mixed_res(params,points,weight):
    a=params[0]
    coffs=params[1:]
    [xList,yList] = zip(*points)
    sqrtList=[a*np.sqrt(x) for x in xList]
    bSplList=[bspline(x,[0,0,.1,.2,.3,.4,.6,.9,.9,.9],coffs,2) for x in xList]
    diffTrue=[y-s-b for y,s,b in zip(yList,sqrtList,bSplList)]
    diffWeight=[ weight*(s-b)/(s+.001) for s,b in zip(sqrtList,bSplList)]

    return diffTrue+diffWeight

和调用适合

fitVals=[.4]+[.2,.3,.4,.2,.3,.4,.4]
for s in [0,.5,1,2]:
    print s
    fitVals,ier=leastsq(mixed_res,fitVals,args=( zipData, s ) )
    th_b_y=[fitVals[0]*np.sqrt(x)+bspline(x,knotList,fitVals[1:],order) for x in xListSorted]
    ax.plot(xListSorted,th_b_y)

剩下的大问题是：你如何决定采取哪种权重？ 更像是平方根是什么意思？

将R中的曲线拟合到等式中

1 个答案: