Question

用例：选择＆＃34;最佳阈值＆＃34;用于使用statsmodel的Logit构建的Logistic模型来预测说二进制类（或多项式，但是整数类）

要在Python中选择（例如，逻辑）模型的阈值，是否有内置的东西？对于小数据集，我记得，通过获取真实预测标签的最大桶来优化＆＃34;阈值＆＃34;（真＆＃34; 0＆＃34;和真＆＃34; 1＆＃34; ），从图中最好看 - http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

我也直观地知道如果我设置了alpha值，它应该给我一个＆＃34;阈值＆＃34;我可以在下面使用。如果使用带变量的简化模型，我应该如何计算阈值，所有这些都具有95％的置信度？显然设置阈值> 0.5 - ＆gt;＆＃34; 1＆＃34;会太天真了因为我看到95％的信心，这个门槛应该是＆＃34;更小＆＃34; ，意思是p> 0.2或者其他东西。

这将意味着类似于＆＃34;关键值的范围＆＃34;如果标签应该是＆＃34; 1＆＃34;和＆＃34; 0＆＃34;否则。

我想要的是这样的东西 - ：

test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2 
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data


# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989) 
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%

＃其中y_train是REAL值＆amp; y_predicted回到TRAIN数据集。请注意，要获得y_predicted（作为二进制文件，我已经按上面的方式进行了阈值处理

问题： -

1。如何以客观的方式选择阈值 - 即减少错误分类标签的百分比（假设我更关心缺失＆＃34; 1＆＃34;（真阳性），但是，如果我错误地预测了＆＃34; 0＆＃34; 1＆＃34;（假阴性）＆amp;尝试减少这个错误。这是我从ROC曲线得到的。在statsmodels中的roc曲线（ roc_curve）假设我已经为y_predicted类做了标记，而我只是重新验证了这个过度测试（如果我的理解不正确，请指出我。）我也认为，使用混淆矩阵也无法解决问题。

2。带给我的是 - 我应该如何消耗这些内置函数（oob，confusion_matrix）的输出以适应选择最佳阈值（首先列车样品，＆amp;然后在Test＆amp; cross validation sample）上微调它

我在这里查看了scipy的K-S测试的官方文档 - http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

相关 - ： Statistics Tests (Kolmogorov and T-test) with Python and Rpy2

在python中选择具有二进制类标签的模型的阈值

0 个答案: