在Python中用权重拟合对数正态分布的正确方法

时间:2018-07-18 20:02:48

标签: python numpy scipy

目前,我有适合对数正态分布的代码。

shape,  loc,  scale  = sm.lognorm.fit(dataToLearn, floc = 0)

for b in bounds:
    toPlot.append((b, currCount+sm.lognorm.ppf(b, s = shape, loc = loc, scale = scale)))

我希望能够将权重向量传递给拟合。目前,我有一种解决方法,我将所有权重四舍五入到2位小数,然后将每个值重复w次,以便正确加权。

for i, d in enumerate(dataToLearn):
    dataToLearn2 += int(w[i] * 100) * [d]

此操作的运行时间对于我的计算机来说太慢了,所以我希望有一个更正确的解决方案。

请告知是使用scipy还是numpy来使我的解决方法更快,更有效

2 个答案:

答案 0 :(得分:2)

SciPy分布未实现加权拟合。但是,对于对数正态分布,maximum likelihood estimation(未加权)有明确的公式,可以很容易地将这些公式推广到加权数据。显式公式都是(实际上)平均值,加权数据情况的一般化方法是在公式中使用加权平均值。

这是一个脚本,该脚本使用具有整数权重的小数据集来演示计算,因此我们知道拟合参数的确切值是什么。

import numpy as np
from scipy.stats import lognorm


# Sample data and weights.  To enable an exact comparison with
# the method of generating an array with the values repeated
# according to their weight, I use an array of weights that is
# all integers.
x = np.array([2.5, 8.4, 9.3, 10.8, 6.8, 1.9, 2.0])
w = np.array([  1,   1,   2,    1,   3,   3,   1])


#-----------------------------------------------------------------------------
# Fit the log-normal distribution by creating an array containing the values
# repeated according to their weight.
xx = np.repeat(x, w)

# Use the explicit formulas for the MLE of the log-normal distribution.
lnxx = np.log(xx)
muhat = np.mean(lnxx)
varhat = np.var(lnxx)

shape = np.sqrt(varhat)
scale = np.exp(muhat)

print("MLE using repeated array: shape=%7.5f   scale=%7.5f" % (shape, scale))


#-----------------------------------------------------------------------------
# Use the explicit formulas for the weighted MLE of the log-normal
# distribution.

lnx = np.log(x)
muhat = np.average(lnx, weights=w)
# varhat is the weighted variance of ln(x).  There isn't a function in
# numpy for the weighted variance, so we compute it using np.average.
varhat = np.average((lnx - muhat)**2, weights=w)

shape = np.sqrt(varhat)
scale = np.exp(muhat)

print("MLE using weights:        shape=%7.5f   scale=%7.5f" % (shape, scale))


#-----------------------------------------------------------------------------
# Might as well check that we get the same result from lognorm.fit() using the
# repeated array

shape, loc, scale = lognorm.fit(xx, floc=0)

print("MLE using lognorm.fit:    shape=%7.5f   scale=%7.5f" % (shape, scale))

输出为

MLE using repeated array:  shape=0.70423   scale=4.57740
MLE using weights:         shape=0.70423   scale=4.57740
MLE using lognorm.fit:     shape=0.70423   scale=4.57740

答案 1 :(得分:1)

您可以使用numpy.repeat使解决方法更有效:

private void LookUpBtn_Click(object sender, RoutedEventArgs e)
{
    if (UserIDUpdateTB.Text == "")
    {
        MessageBox.Show("Customer ID is needed.", "Error");
    }
    else
    {
        SqlConnection con = new SqlConnection(@"Data Source=DESKTOP-8QAH8VK\SQLDB; Initial Catalog=Restaurant_DB; Integrated Security=True;");
        con.Open();

        SqlCommand lookforcustomer = new SqlCommand("LookForCustomer", con);
        lookforcustomer.CommandType = CommandType.StoredProcedure;

        lookforcustomer.Parameters.AddWithValue("userid", UserIDUpdateTB.Text);
        //lookforcustomer.ExecuteNonQuery();

        SqlDataReader reader = lookforcustomer.ExecuteReader();
        reader.Read();
        object test = reader.GetValue(1);

        MessageBox.Show(test.ToString(), "Error");

        var id = (int?)lookforcustomer.ExecuteScalar();
        con.Close();
    }
}

import numpy as np dataToLearn = np.array([1,2,3,4,5]) weights = np.array([1,2,1,1,3]) print(np.repeat(dataToLearn, weights)) # Output: array([1, 2, 2, 3, 4, 5, 5, 5]) 的性能进行了非常基本的测试:

numpy.repeat

结果,我目前的方法大约是3.38,而import timeit code_before = """ weights = np.array([1,2,1,1,3] * 1000) dataToLearn = np.array([1,2,3,4,5] * 1000) dataToLearn2 = [] for i, d in enumerate(dataToLearn): dataToLearn2 += int(weights[i]) * [d] """ code_after = """ weights = np.array([1,2,1,1,3] * 1000) dataToLearn = np.array([1,2,3,4,5] * 1000) np.repeat(dataToLearn, weights) """ print(timeit.timeit(code_before, setup="import numpy as np", number=1000)) print(timeit.timeit(code_after, setup="import numpy as np", number=1000)) 的大约是0.75