Question

我正在尝试使用bootstrap方法在某些参数上实现置信区间。但是，我有一点问题。即使我使用3000个样本，我的置信区间也会有很大差异。

情况如下：

我有一个大约300个点的数据集，以传统的方式定义y = f（x）。我知道适合数据的模型。所以我用curve_fit找到参数，并尝试为每个参数建立置信区间。我试着混合这里描述的方法：

confidence interval with leastsq fit in scipy python

在这里：

http://www.variousconsequences.com/2010/02/visualizing-confidence-intervals.html

以下是我使用的代码：

def model(t, Vs, Vi, k):

    """
    Fitting model, following a Burst kinetics.
    t is the time
    Vs is the steady velocity
    Vi is the initial velocity
    k is the Burst rate constant
    """

    y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)

    return y



[some code]

bootindex = np.random.random_integers
nboot = 3000


local_t = np.array(local_t)
local_fluo = np.array(local_fluo)
concentration = np.array(concentration)

#Initializing time values in hours
local_scaled_t = [ index /3600 for index in local_t ]
local_scaled_t = np.array(local_scaled_t)

conc_produit = [ concentration[0] - value_conc for value_conc in concentration ]
conc_produit = np.array(conc_produit)

popt, pcov = curve_fit(model, local_scaled_t, conc_produit, maxfev=3000)
popt = [ popt[0] / 3600, popt[1] / 3600 , popt[2] / 3600 ]

ymod = list()
for each in local_t:
        ymod.append(model(each, popt[0], popt[1], popt[2]))
ymod = np.array(ymod)

r = conc_produit - ymod

list_para = list()

# loop over n bootstrap samples from the resids 
for i in range(nboot): 

    pc, pout = curve_fit(model, local_scaled_t, ymod + r[bootindex(0, len(r)-1, len(r))], maxfev=3000) 
    pc = [ pc[0] / 3600, pc[1] / 3600 , pc[2] / 3600 ]

    list_para.append(pc)

    ymod = list()
    for each in local_t:
            ymod.append(model(each, pc[0], pc[1], pc[2]))
    ymod = np.array(ymod)

list_para = np.array(list_para)

mean_params = np.mean(list_para,0)
std_params = np.std(list_para,0)

print(popt)
for true_para, para, std in zip(popt, mean_params, std_params):
    print("{0} between {1} and {2}".format(round(true_para, 6), round(para - std * 1.95996, 6), round(para + std * 1.95996, 6)))
    print("{0} between {1} and {2}".format(round(true_para, 6), round(para - std * 1.95996, 6), round(para + std * 1.95996, 6)))

这里没有什么复杂的，只需注意我重新调整时间来规范化我的数据并获得更好的参数。

最后，这里有2个输出，代码相同：

[1.9023455671995163e-05, 0.01275941716148471, 0.026540319119773129]
1.9e-05 between 1.6e-05 and 2.1e-05
0.012759 between -0.042697 and 0.092152
0.02654 between -0.073456 and 0.159983

[1.9023455671995163e-05, 0.01275941716148471, 0.026540319119773129]
1.9e-05 between 1.5e-05 and 2.9e-05
0.012759 between -0.116499 and 0.17112
0.02654 between -0.186011 and 0.27797

正如您所看到的，差异非常大。是预期还是我做错了什么？举个例子，我真的不明白为什么我必须乘以标准偏差1.95996。

Answer 1

您的curve_fit已经为您提供了协方差矩阵，即pout。第i个参数的95％置信限是：pc[i]-1.95596*sqrt(pout[i,i])和pc[i]+1.95596*sqrt(pout[i,i])。 1.95596是x，这样标准正态分布的累积分布函数F（x）= 0.975。您可以使用scipy.stats.norm.ppf获得其他级别的置信区间。请参阅wiki：http://en.wikipedia.org/wiki/1.96

每次运行时，Bootstrap都不会给你相同（或有时甚至是近距离）的答案。对于您的特定功能，极少数早期数据点对拟合Solve equation with a set of points有很大影响。我不确定引导程序是否可行，就好像很少的早期数据点未被采样一样，拟合将与原始数据的拟合非常不同。这也解释了为什么你的引导间隔彼此之间存在差异。

Bootstrap方法和置信区间

1 个答案: