我正在使用scikit在python中进行逻辑回归研究。 我有可以通过以下链接下载的数据文件。
下面是我的机器学习代码部分。
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.3)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)
当我打印系数时,结果全为零。 有人可以建议我吗?
让我解释一下我的目标。这个问题似乎是分类问题,因为我们在Ytrain和Ytest中只能看到0或1。如果我们举一个简单的例子,可以将0视为缺失,将1视为得分。我想做的是计算发生射击时每个事件的概率得分。
谢谢。
此致
Zep
答案 0 :(得分:1)
我只是在套索中更改alpha: my result
答案 1 :(得分:1)
您的Y
变量仅包含0
和1
个。如果您仍然想对该数据进行回归分析,请对不同的alpha参数使用GridSearch。
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.0009)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)
结果
MC learning completed
0.37884924358295613
0.3806187071242917
[ 0.00078099 0.13397938 -0.00554932 0.00194722 0.00232949 -0.01100195
-0.01363906 0.13031317 -0.00146605]
GridSearchCV
from sklearn.model_selection import GridSearchCV
import numpy as np
# Define the grid for the alpha parameter
parameters = {'alpha':[0.01, 0.001, 0.0005]}
# Fit it on X, Y and define the cv parameter for cross-validation
clf = GridSearchCV(lasso, parameters, cv = 3)
clf.fit(X, Y)
# Get the best parameters and model
print(clf.best_estimator_)
注意:要定义特定的参数空间,请使用:parameters = {'alpha': np.arange(0.001,1,0.02)}
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
# Logistic Regression (aka logit, MaxEnt) classifier.
lr = LogisticRegression()
lr.fit(X_train,y_train)
# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])
# The proba of the first testing sample to belong to class 0 is 0.8704 and to class 1 0.1295
[[0.87046267 0.12953733]
[0.87797594 0.12202406]
[0.80046704 0.19953296]]
答案 2 :(得分:0)
Y中的数据看起来像类。它们是0或1。因此,您应该使用分类算法,然后使用coeff来获得概率。
大多数scikit分类器都有一个predict_proba()
,您可以直接使用获取概率。
如果需要绝对使用回归模型,则可以尝试使用LinearRegression(将使用普通最小二乘法)或LassoCV
(将自动调整alpha以适应需要)。< / p>