两个高斯混合交换值。全局优化

时间:2014-10-15 12:18:42

标签: python numpy

我有数据,我希望与两个高斯人配合,同时保持一个全球平均值。我已经使用scipy,lmfit,numpy库编写了Python程序。这是我已经拟合的数据结果(最小二乘):

mean1   sd1     A1      mean2   sd2     A2      y0
12.24   10.20   27526   25.50   20.42   30642   499.93
21.43   10.20   27529   25.51   20.39   30616   500.32
25.51   20.40   30599   30.61   10.21   27552   500.16
39.80   10.20   27536   25.52   20.42   30636   499.85
25.51   20.41   30616   48.98   10.21   27559   499.94

我的计算功能:

y0 + + sqrt(2/PI)*A1/w1*exp(-2*(x-xc1)^2/w1^2) + sqrt(2/PI)*A2/w2*exp(-2*(x-xc2)^2/w2^2)
Sorry, I dont know how to change into normal math formula.

这是一个测试,所以正确答案必须是:

    mean1   sd1 A1      mean2   sd2 A2      y0
1   12      10  27000   25      20  30000   500
2   21      10  27000   25      20  30000   500
3   30      10  27000   25      20  30000   500
4   39      10  27000   25      20  30000   500
5   48      10  27000   25      20  30000   500

如您所见,它适合独立装配。问题是我的书面拟合程序有时会交换第一高斯和第二高斯参数值"这意味着现在如果我尝试为每个数据集设置mean2固定,它将会出错,因为第3和第5个数据集被交换,因此mean2将不正确(但我不确定)(对于这个例子,mean2总是必须是25)。这个问题在实际数据中更加严苛。 基本上,正如我所理解的那样,因为我的函数是f = y + gauss1 + gauss2,并且两个Gausses都是相同的,所以它看起来没有任何区别来拟合gauss1或gauss2,有时会把它混合起来。

输出全局适合度:

mean1   sd1     A1      mean2   sd2     A2      y0
12.28   10.31   28483   25.90   19.77   29169   508.60
21.42   10.42   29148   25.90   20.51   28746   505.21
30.61   9.99    26045   25.90   20.26   32149   499.46
39.84   10.11   26605   25.90   21.44   33000   475.15
48.87   9.49    25000   25.90   23.00   33000   485.45

尝试的实验数据(dab seperated):

321 759 568 567 567 567
322 877 587 585 585 585
323 1033    610 606 606 606
324 1231    639 632 632 632
325 1471    675 662 662 662
326 1745    721 697 697 697
327 2043    780 737 737 737
328 2346    855 782 782 782
329 2632    954 833 833 833
330 2877    1080    889 889 889
331 3061    1241    951 949 949
332 3168    1440    1017    1014    1014
333 3194    1682    1089    1083    1083
334 3142    1962    1166    1154    1154
335 3025    2275    1250    1226    1226
336 2863    2605    1341    1298    1298
337 2676    2933    1442    1369    1369
338 2485    3236    1558    1437    1437
339 2308    3488    1691    1500    1500
340 2155    3668    1848    1558    1556
341 2031    3759    2031    1608    1605
342 1936    3756    2243    1651    1644
343 1865    3662    2482    1686    1673
344 1812    3490    2739    1715    1691
345 1770    3261    3003    1740    1697
346 1734    2997    3255    1764    1691
347 1697    2722    3473    1794    1673
348 1657    2453    3633    1836    1645
349 1611    2204    3716    1896    1606
350 1560    1983    3710    1983    1560
351 1501    1791    3611    2099    1506
352 1437    1628    3425    2245    1450
353 1369    1490    3168    2418    1393
354 1298    1372    2863    2605    1341
355 1226    1269    2533    2790    1299
356 1154    1177    2202    2953    1274
357 1083    1095    1891    3071    1274
358 1014    10211613    3126    1306
359 949 952 1376    3103    1376
360 889 890 1180    3000    1488
361 833 833 1024    2821    1641
362 782 782 903 2582    1831
363 737 737 810 2301    2043
364 697 697 740 2003    2261
365 662 662 686 1711    2461
366 632 632 645 1440    2621
367 606 606 613 1205    2718
368 585 585 588 1011    2739
369 567 567 569 859 2679

我的脚本(取消注释全局适合上述部分):

import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, report_fit
# python 3.3
# Unofficial Windows Binaries for Python Extension Packages
# http://www.lfd.uci.edu/~gohlke/pythonlibs/
# VARIABLES
show_plot = 1
size_cols = 11
size_rows = 50
nm_start = 320
data_sets = 5
file_name = "5_testas.txt"
intens = [[[0] for i in range(size_cols)] for j in range(size_rows)]
with open(file_name) as f:
    for row in range (0, size_rows):
        datal = f.readline();
        data = datal.split();
        col = 0;
        for datab in data:
          intens[row][col] = datab;
          col = col+1;
#def gauss(x, amp, cen, sigma):
#    "basic gaussian"
def gauss(x, mean, sd, A):
    "basic gaussian"
    return np.sqrt(2/np.pi)*A/sd*np.exp(-2*np.power(((x-mean)/sd), 2))
def gauss_dataset(params, i, x):
    """calc gaussian from params for data set i
    using simple, hardwired naming convention"""
    mean1 = params['mean1_%i' % (i+1)].value
    sd1 = params['sd1_%i' % (i+1)].value
    A1 = params['A1_%i' % (i+1)].value
    mean2 = params['mean2_%i' % (i+1)].value
    sd2 = params['sd2_%i' % (i+1)].value
    A2 = params['A2_%i' % (i+1)].value
    y0 = params['y0_%i' % (i+1)].value
    return y0 + gauss(x, mean1, sd1, A1) + gauss(x, mean2, sd2, A2)
def gauss_dataset_a(params, i, x):
    """calc gaussian from params for data set i
    using simple, hardwired naming convention"""
    mean1 = params['mean1_%i' % (i+1)].value
    sd1 = params['sd1_%i' % (i+1)].value
    A1 = params['A1_%i' % (i+1)].value
    mean2 = params['mean2_%i' % (i+1)].value
    sd2 = params['sd2_%i' % (i+1)].value
    A2 = params['A2_%i' % (i+1)].value
    y0 = params['y0_%i' % (i+1)].value
    return y0 + gauss(x, mean1, sd1, A1)
def gauss_dataset_b(params, i, x):
    """calc gaussian from params for data set i
    using simple, hardwired naming convention"""
    mean1 = params['mean1_%i' % (i+1)].value
    sd1 = params['sd1_%i' % (i+1)].value
    A1 = params['A1_%i' % (i+1)].value
    mean2 = params['mean2_%i' % (i+1)].value
    sd2 = params['sd2_%i' % (i+1)].value
    A2 = params['A2_%i' % (i+1)].value
    y0 = params['y0_%i' % (i+1)].value
    return y0 + gauss(x, mean2, sd2, A2)

def objective(params, x, data):
    """ calculate total residual for fits to several data sets held
    in a 2-D array, and modeled by Gaussian functions"""
    ndata, nx = data.shape
    resid = 0.0*data[:]
    # make residual per data set
    for i in range(ndata):
        resid[i, :] = data[i, :] - gauss_dataset(params, i, x)
    # now flatten this to a 1D array, as minimize() needs
    return resid.flatten()

x  = np.linspace(0, 50, 50)
data = []
# dummy data
for i in np.arange(data_sets):
    dat   = gauss(x, 1, 1, 1)
    data.append(dat)

# data has shape
data = np.array(data)

# Rearange data, exclude 1st set.
for col in range(0, data_sets):
    for row in range (0, size_rows):
        data[col][row] = intens[row][col+1]

# create 5 sets of parameters, one per data set
fit_params = Parameters()
for iy, y in enumerate(data):
    fit_params.add( 'mean1_%i' % (iy+1), value=26.0, min=0.0,  max=50.0)
    fit_params.add( 'mean2_%i' % (iy+1), value=26.0, min=0.0,  max=50.0)
    fit_params.add( 'A1_%i' % (iy+1), value=28500.0, min=25000.0, max=33000.0)
    fit_params.add( 'A2_%i' % (iy+1), value=28500.0, min=25000.0, max=33000.0)
    fit_params.add( 'sd1_%i' % (iy+1), value=15.0, min=7.0,  max=23.0)
    fit_params.add( 'sd2_%i' % (iy+1), value=15.0, min=7.0,  max=23.0)
    fit_params.add( 'y0_%i' % (iy+1), value=1000.0, min=300.0, max=1500.0)

# UNCOMMENT FOR GLOBAL FIT
#for iy in range(2, data_sets+1): 
    #fit_params['mean2_%i' % iy].expr='mean2_1'


# run the global fit to all the data sets
minimize(objective, fit_params, args=(x, data))

# plot the data sets and fits
plt.figure()
print('mean1\tsd1\tA1\tmean2\tsd2\tA2\ty0')
for i in range(data_sets):
    print("%0.2f" % fit_params['mean1_%i' % (i+1)].value+'\t'+"%0.2f" % fit_params['sd1_%i' % (i+1)].value+'\t'+"%0.0f" % fit_params['A1_%i' % (i+1)].value+'\t'+"%0.2f" % fit_params['mean2_%i' % (i+1)].value+'\t'+"%0.2f" % fit_params['sd2_%i' % (i+1)].value+'\t'+"%0.0f" % fit_params['A2_%i' % (i+1)].value+'\t'+"%0.2f" % fit_params['y0_%i' % (i+1)].value, end="\n")
if show_plot == 1:
    for i in range(data_sets):
        y_fit = gauss_dataset(fit_params, i, x)
        y_fit_a = gauss_dataset_a(fit_params, i, x)
        y_fit_b = gauss_dataset_b(fit_params, i, x)
        plt.plot(x, data[i, :], 'o', x, y_fit, '-')
        plt.plot(x, data[i, :], 'o', x, y_fit_a, '-')
        plt.plot(x, data[i, :], 'o', x, y_fit_b, '-')
        plt.show()

那么,我怎样才能改进我的代码呢? 全球适合真的包含错误的手段吗?因为它有点接近25.我没有工具来检查它。 此外,这是正常的,我的价值观有点"关"真实的。例如,我不认为mean2为25,每个数据集为~25.5。

1 个答案:

答案 0 :(得分:0)

首先,这是您的数据图: A

当您开始使用两条高斯曲线的相同参数时,很明显计算机不知道哪一个应该是数据中的哪一个。那么,你能做什么?

  1. 将一个均值设为低值,另一个为高值。这很有可能,但不是100%。
  2. 只需比较两个峰值,如果订单错误,只需交换它们。
  3. 如果你知道第一个峰值总是在25左右,那么只需让它的平均值保持不变,然后再拟合所有其他参数,然后重新拟合,包括先前固定的平均值。这通常有效,因为在第一次拟合期间,其他参数非常接近最终值。因此,在第二次拟合期间,变化不是很大。
  4. 我还可以确认存在大约1的偏移,至少对于第2列。看起来高斯函数中的x值与数据的x值不同。