我已经为汽车贷款建立了一个逻辑回归,其中包含“是默认的贷款是或否”作为二元因变量,我使用了大约20个独立变量,数据集包含3327条记录。
我将基础数据拆分为训练集和测试集。然而,在我将模型拟合到训练数据上并要求它然后预测测试数据时,如果训练集中有大约12%的时间,那么当应该有一些“1”输出时,我得到所有“0”的输出。二进制默认值为“1”或无默认变量。
我看过测试和训练集,这些测试和训练集在切换前都看起来很好(没有缺失值,类别变量是虚拟变量,训练/测试子集正确地随机选择记录,所以没有故障在那里我可以看到)。
有趣的是,函数“predict_proba”表示对于每个输出元素(0.7-0.9概率),预测获得“0”的概率总是很高。我不确定如何最好地纠正这个问题,因为我宁愿将默认阈值设置为0.5,但我不确定如何清理这个混乱。
是否仅仅是因为我需要更多数据给定自变量的数量,或者我错过了什么/做错了什么?
谢谢!
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import statsmodels.api as sm
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
#open the file
data = pd.read_csv(r"log reg test Lending club 2007-2011 and 2014 car only no dummy trap.csv")
print(data.shape)
##print(list(data.columns))
print(data['Distressed'].value_counts()) ## check number of defaulted car loans is binary
sns.countplot(x='Distressed', data=data, palette='hls')
print(plt.show()) ## confrim dependent variable is binary
##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())
##testing for nulls in dataset
print('Table showing cumulative number of missing data points', data.isnull().sum())
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant
print('Here is the sample showing no missing data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
#scrub_data['intercept']=0
print(list(scrub_data.columns))
print(scrub_data.head())
##convert categorical variables to dummies completed in csv file
## Agrade and Own dummies removed to avoid dummy variable trap and are treated as the base case here
X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,22)].values
y=scrub_data.ix[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=0)
print('Here are the X components', X)
print('Here are the y components', y)
print('Here are the X values of the training', X_train)
print('Here are the y train values', y_train)
print('Here are the y test values', y_test)
model=LogisticRegression()
model.fit(X_train,y_train) ##Model is learning the relationship between X_train and y_train
y1_pred=model.predict(X_train)
print('y predict of train data', y1_pred)
print('Here is the Model Score', model.score(X_train,y_train)) ##check accuracy of training set
print('What percentage defaulted', y_train.mean()) ##what percentage defaulted
print('What percentage of test set defaulted', y_test.mean()) ##what percentage defaulted
print('X test values', X_test) ## check test subset values
y_pred=model.predict(X_test)
probs=model.predict_proba(X_test)