使用scikit Learn在Logistic回归中所有系数均变为零

时间:2018-07-18 09:02:17

标签: python scikit-learn logistic-regression

我正在使用scikit在python中进行逻辑回归研究。 我有可以通过以下链接下载的数据文件。

link for data

下面是我的机器学习代码部分。

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.3)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)

当我打印系数时,结果全为零。 有人可以建议我吗?

让我解释一下我的目标。这个问题似乎是分类问题,因为我们在Ytrain和Ytest中只能看到0或1。如果我们举一个简单的例子,可以将0视为缺失,将1视为得分。我想做的是计算发生射击时每个事件的概率得分。

谢谢。

此致

Zep

3 个答案:

答案 0 :(得分:1)

我只是在套索中更改alpha: my result

答案 1 :(得分:1)

您的Y变量仅包含01个。如果您仍然想对该数据进行回归分析,请对不同的alpha参数使用GridSearch。

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.0009)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)

结果

MC learning completed
0.37884924358295613
0.3806187071242917
[ 0.00078099  0.13397938 -0.00554932  0.00194722  0.00232949 -0.01100195
 -0.01363906  0.13031317 -0.00146605]

GridSearchCV

from sklearn.model_selection import GridSearchCV
import numpy as np

# Define the grid for the alpha parameter
parameters = {'alpha':[0.01, 0.001, 0.0005]}

# Fit it on X, Y and define the cv parameter for cross-validation
clf = GridSearchCV(lasso, parameters, cv = 3)
clf.fit(X, Y)

# Get the best parameters and model
print(clf.best_estimator_)

注意:要定义特定的参数空间,请使用:parameters = {'alpha': np.arange(0.001,1,0.02)}


编辑1:在考虑了您刚刚在问题中添加的最后一段之后,请使用以下内容:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)

# Logistic Regression (aka logit, MaxEnt) classifier.
lr = LogisticRegression()
lr.fit(X_train,y_train)

# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])

# The proba of the first testing sample to belong to class 0 is 0.8704 and to class 1 0.1295
[[0.87046267 0.12953733]
 [0.87797594 0.12202406]
 [0.80046704 0.19953296]]

答案 2 :(得分:0)

Y中的数据看起来像类。它们是0或1。因此,您应该使用分类算法,然后使用coeff来获得概率。

大多数scikit分类器都有一个predict_proba(),您可以直接使用获取概率。

如果需要绝对使用回归模型,则可以尝试使用LinearRegression(将使用普通最小二乘法)或LassoCV(将自动调整alpha以适应需要)。< / p>