我正在尝试在数据集上实现Keras回归模型以用于学习目的。
我已经从Kaggle贷款违约预测挑战中获取数据而我正在尝试预测一个人是否会拖欠贷款
目标列似乎不平衡,大多数观察似乎都有" 0"
我已经尝试了以下方法来克服这种数据不平衡(a)下采样多数类(b)例如,少数类(c)使用SMOTE算法。但是这些方法似乎没有帮助原因,并且模型的预测仅偏向于" 0"因为数据集中的大多数类是" 0"。我已经使用了sklearn中的resample方法来执行下采样和上采样。
我可以尝试采用哪些不同的方法来克服此问题,并使用我的模型对此数据实现良好的准确性,并从模型中获得真实的预测。我正在分享我的代码
from keras.models import Sequential
from keras.layers import Dense
from keras.regularizers import L1L2
import pandas
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
from sklearn import preprocessing as pre
train = pandas.read_csv('/train_v2.csv/train_v2.csv')
# Defining the target column
train_loss = train.loss
# Defining the features for the model
train = train[['f527','f528','f271']]
# Defining the imputer function
imp = Imputer()
# Fitting the imputation function to the training dataset
imp.fit(train)
train = imp.transform(train)
train=pre.StandardScaler().fit_transform(train)
# Splitting the data into Training and Testing samples
X_train,X_test,y_train,y_test = train_test_split( train,
train_loss,test_size=0.3, random_state=42)
# logistic regression with L1 and L2 regularization
reg = L1L2(l1=0.01, l2=0.01)
model = Sequential()
model.add(Dense(13,kernel_initializer='normal', activation='relu',
W_regularizer=reg, input_dim=X_train.shape[1]))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, nb_epoch=10, validation_data=(X_test, y_test))