我正在尝试对德国信用数据运行logit回归(www4.stat.ncsu.edu/~boos/var.select/german.credit.html)。为了测试代码,我只使用了数值变量,并尝试使用以下代码对结果进行回归。
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("germandata.txt",delimiter=' ')
df.columns = ["chk_acc","duration","history","purpose","amount","savings_acc","employ_since","install_rate","pers_status","debtors","residence_since","property","age","other_plans","housing","existing_credit","job","no_people_liab","telephone","foreign_worker","admit"]
#pls note that I am only retaining numeric variables
cols_to_keep = ['admit','duration', 'amount', 'install_rate','residence_since','age','existing_credit','no_people_liab']
# rank of cols_to_keep is 8
print np.linalg.matrix_rank(df[cols_to_keep].values)
data = df[cols_to_keep]
data['intercept'] = 1.0
train_cols = data.columns[1:]
#to check the rank of train_cols, which in this case is 8
print np.linalg.matrix_rank(data[train_cols].values)
#fit logit model
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()
检查数据时,所有8.0列都是独立的。尽管如此,我得到奇异矩阵错误。你能帮忙吗?
由于
答案 0 :(得分:9)
endog
y变量需要为零,一个。在此数据集中,它具有1和2中的值。如果我们减去1,则会生成结果。
>>> logit = sm.Logit(data['admit'] - 1, data[train_cols])
>>> result = logit.fit()
>>> print result.summary()
Logit Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 999
Model: Logit Df Residuals: 991
Method: MLE Df Model: 7
Date: Fri, 19 Sep 2014 Pseudo R-squ.: 0.05146
Time: 10:06:06 Log-Likelihood: -579.09
converged: True LL-Null: -610.51
LLR p-value: 4.103e-11
===================================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
duration 0.0261 0.008 3.392 0.001 0.011 0.041
amount 7.062e-05 3.4e-05 2.075 0.038 3.92e-06 0.000
install_rate 0.2039 0.073 2.812 0.005 0.062 0.346
residence_since 0.0411 0.067 0.614 0.539 -0.090 0.172
age -0.0213 0.007 -2.997 0.003 -0.035 -0.007
existing_credit -0.1560 0.130 -1.196 0.232 -0.412 0.100
no_people_liab 0.1264 0.201 0.628 0.530 -0.268 0.521
intercept -1.5746 0.430 -3.661 0.000 -2.418 -0.732
===================================================================================
然而,在其他情况下,当我们在远离最佳值时评估它时,Hessian可能不是正定的,例如在较差的起始值。在这些情况下,切换到不使用Hessian的优化器通常会成功。例如,scipy的'bfgs'是一个很好的优化器,可以在许多情况下工作
result = logit.fit(method='bfgs')
答案 1 :(得分:0)
我设法通过删除低方差列来解决此问题:
from sklearn.feature_selection import VarianceThreshold
def variance_threshold_selector(data, threshold=0.5):
# https://stackoverflow.com/a/39813304/1956309
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
# min_variance = .9 * (1 - .9) # You can play here with different values.
min_variance = 0.0001
low_variance = variance_threshold_selector(df, min_variance)
print('columns removed:')
df.columns ^ low_variance.columns
df.shape
df.shape
X = low_variance
# (Logit(y_train, X), logit.fit()... etc)
为了提供更多的上下文信息,在执行此步骤之前,我已经对某些分类数据进行了一次热烈的探讨,并且某些列中的列数很少。