Question

我已经为汽车贷款建立了一个逻辑回归，其中包含“是默认的贷款是或否”作为二元因变量，我使用了大约20个独立变量，数据集包含3327条记录。

我将基础数据拆分为训练集和测试集。然而，在我将模型拟合到训练数据上并要求它然后预测测试数据时，如果训练集中有大约12％的时间，那么当应该有一些“1”输出时，我得到所有“0”的输出。二进制默认值为“1”或无默认变量。

我看过测试和训练集，这些测试和训练集在切换前都看起来很好（没有缺失值，类别变量是虚拟变量，训练/测试子集正确地随机选择记录，所以没有故障在那里我可以看到）。

有趣的是，函数“predict_proba”表示对于每个输出元素（0.7-0.9概率），预测获得“0”的概率总是很高。我不确定如何最好地纠正这个问题，因为我宁愿将默认阈值设置为0.5，但我不确定如何清理这个混乱。

是否仅仅是因为我需要更多数据给定自变量的数量，或者我错过了什么/做错了什么？

谢谢！

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import statsmodels.api as sm
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)


#open the file
data = pd.read_csv(r"log reg test Lending club 2007-2011 and 2014 car only no dummy trap.csv")
print(data.shape)
##print(list(data.columns))

print(data['Distressed'].value_counts()) ## check number of defaulted car loans is binary

sns.countplot(x='Distressed', data=data, palette='hls')
print(plt.show()) ## confrim dependent variable is binary


##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())

##testing for nulls in dataset
print('Table showing cumulative number of missing data points', data.isnull().sum()) 
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant

print('Here is the sample showing no missing data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
#scrub_data['intercept']=0 
print(list(scrub_data.columns))
print(scrub_data.head())

##convert categorical variables to dummies completed in csv file
## Agrade and Own dummies removed to avoid dummy variable trap and are treated as the base case here

X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,22)].values
y=scrub_data.ix[:,0].values




X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=0) 


print('Here are the X components', X) 
print('Here are the y components', y) 
print('Here are the X values of the training', X_train) 
print('Here are the y train values', y_train)
print('Here are the y test values', y_test) 

model=LogisticRegression()
model.fit(X_train,y_train) ##Model is learning the relationship between X_train and y_train
y1_pred=model.predict(X_train)
print('y predict of train data', y1_pred)

print('Here is the Model Score', model.score(X_train,y_train)) ##check accuracy of training set
print('What percentage defaulted', y_train.mean()) ##what percentage defaulted
print('What percentage of test set defaulted', y_test.mean()) ##what percentage defaulted

print('X test values', X_test) ## check test subset values
y_pred=model.predict(X_test) 
probs=model.predict_proba(X_test)

python sklearn Logistic回归预测所有0

0 个答案: