Question

我有一个包含日期的Python数组，表示特定年份中出现现象的次数。该向量包含200个不同的日期，每个日期重复一定次数。重复是该现象的发生次数。我设法使用以下代码片段计算并使用matplotlib绘制累积总和：

counts = arange(0, len(list_of_dates))
# Add the cumulative sum to the plot (list_of_dates contains repetitions)
plt.plot(list_of_dates, counts, linewidth=3.0)

Cumulative sum (in blue) per date

在蓝色中，您可以看到描绘累积总和的曲线，在其他颜色中可以看到我想要获得的参数。但是，我需要蓝色曲线的数学表示才能获得这些参数。我知道这种类型的曲线可以使用逻辑回归进行调整，但是，我不明白如何在Python中这样做。

首先我尝试使用Scikit-learn中的LogisticRegression，但后来我意识到他们似乎正在使用这个模型进行机器学习classification（和其他类似的东西），这不是我想要的。

然后我想我可以直接去逻辑函数的定义并尝试自己构建它。我找到了this thread，建议使用scipy.special.expit来计算曲线。看来这个功能已经实现了，所以我决定使用它。所以我这样做了：

target_vector = dictionary.values() Y = expit(target_vector) plt.plot(list_of_dates, y, linewidth=3.0)

我得到了一个带有209个元素（与target_vector相同）的向量，如下所示：[ 1. 0.98201379 0.95257413 0.73105858 ... 0.98201379 1. ]。然而，图形输出看起来像是一个孩子在刮纸，而不是像图片中那样漂亮的S形曲线。

我还检查了其他Stack Overflow线程（this，this），但我想我需要做的只是一个与之相比的玩具示例。我只需要数学公式来计算一些快速简单的参数。

有没有办法做到这一点并获得S形函数的数学表示？

非常感谢！

Answer 1

您提到的情节可能看起来很糟糕有几个原因。

第一个是因为dictionary.values()以未排序的顺序返回值。如果你做了类似的事情会发生什么（未经测试，因为我没有你的字典）：

target_pairs = sorted(dictionary.iteritems()) #should be a sorted list of (date, count)
target_vector = [count for (date, count) in target_pairs]

并查看结果target_vector？它现在应该增加。

从那里到逻辑函数需要更多的工作：你需要规范化target_vector以使值位于[0,1]，然后应用scipy.special.logit（这将变为sigmoid on [0,1]成直线），然后你可以找到最合适的线。然后，您可以恢复逻辑模型的参数：

y = C * sigmoid(m*x + b)

m和b是转换数据的线性回归的斜率和截距，C是您对数据进行标准化时除以的事项。

Answer 2

使用this post和昨天发布的评论，我想出了以下代码：

from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import normalize # Added this new line

# This is how I normalized the vector. "ydata" looked like this:
# original_ ydata = [ 1, 3, 8, 14, 12, 27, 33, 36, 87, 136, 77, 57, 32, 31, 28, 24, 12, 2 ]
# The curve was NOT fitting using this values, so I found a function in 
# scikit-learn that normalizes (multidim) arrays: [normalize][2]

# m = []
# m.append(original_ydata)
# ydata = normalize(m, norm='l2') * 10

# Why 10? This function is converting my original values in a range 
# going from [0.00014, ..., 0.002 ] or something similar. So "curve_fit" 
# couldn't find anything but a horizontal line crossing y = 1. 
# I tried multiplying by 5, 6, ..., 12, and I realized that 10 is 
# the maximum value that lets the maximum value of my array below 1.00, like 0.97599. 

# Length of both arrays is 209
# Y-axis data has been normalized BUT then multiplied by 10
ydata = array([  5.09124776e-04,   1.01824955e-03, ... , 9.75992196e-01])
xdata = array(range(0,len(ydata),1))

def sigmoid(x, x0, k):
    y = 1 / (1+ np.exp(-k*(x-x0)))
    return y

popt, pcov = curve_fit(sigmoid, xdata, ydata)

x = np.linspace(0, 250, 250)
y = sigmoid(x, *popt)

plt.plot(xdata, ydata, 'o', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.ylim(0, 1.25)
plt.legend(loc='best')

# This (m, b, C) parameters not sure on where they are... popt, pcov? 
# y = C * sigmoid(m*x + b)

此程序会创建您可以在下面看到的图表。你可以看到这是一个公平的调整，但我想如果我在sigmoid函数中改变Y的定义，通过添加C乘以前1，可能我会得到更好的调整。还在那。

Sigmoid curve fitting

似乎规范化数据（如Ben Kuhn在评论中所建议的）是必需的步骤，否则曲线不会被创建。但是，如果将值标准化为非常低的值（接近零），则也不会绘制曲线。所以我将每10的归一化向量相乘，将其缩放到更大的单位。然后程序只是找到了曲线。我无法解释原因，因为我对此完全是新手。请注意，这只是我个人的经历，我不是说这是一个规则。

如果我打印popt和pcov，我会获得：

#> print popt
[  8.56332788e+01   6.53678132e-02]

#> print pcov
[[  1.65450283e-01   1.27146184e-07]
 [  1.27146184e-07   2.34426866e-06]]

并且documentation on curve_fit表示这些参数包含＆＃34;参数的最佳值，以便最小化平方误差的总和＆＃34; 和协方差上一个参数。

这6个值中的任何一个都是用于表征S形曲线的参数吗？因为如果是这样，那么问题就很难解决了！： - ）

非常感谢！

Python + scipy中sigmoid回归的参数

2 个答案: