处理不平衡数据集中的过度拟合

时间:2019-08-16 15:05:05

标签: python machine-learning keras neural-network

我有一个不平衡的数据集(只有0.06%的数据标记为1,其余的标记为0)。正如我研究的那样,我必须对数据进行重新采样,因此我使用imblearn包来randomUnserSample我的数据集。然后,我使用Keras Sequential创建了一个神经网络。在训练期间,F1Score会增加到75%左右(第1000个时期的结果是:损失:0.5691-acc:0.7543-f1_m:0.7525-precision_m:0.7582-召回率m:0.7472),但是对于测试集,结果令人失望(损失:55.35181%,acc:79.25248%,f1_m:0.39789%,precision_m:0.23259%,recall_m:1.54982%)。

我假设是在火车上,因为1和0的数目相同,因此class_wights都设置为1,因此对于错误的1s预测,网络不会花费太多。

我使用了一些技术,例如减少层数,减少神经元数量,使用正则化和辍学,但是测试集f1Score从未超过0.5%。我该怎么办。谢谢

我的神经网络:

def neural_network(X, y, epochs_count=3, handle_overfit=False):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=len(X_test.columns), activation='relu'))
    if (handle_overfit):
        model.add(Dropout(rate = 0.5))
    model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l1(0.1)))
    if (handle_overfit):
        model.add(Dropout(rate = 0.1))
    model.add(Dense(1, activation='sigmoid'))

    # compile the model
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['acc', f1_m, precision_m, recall_m])

#     change weights of the classes '0' and '1' and set weights automatically
    class_weights = class_weight.compute_class_weight('balanced', [0, 1], y)
    print("---------------------- \n chosen class_wieghts are: ", class_weights, " \n ---------------------")

    # Fit the model
    model.fit(X, y, epochs=epochs_count, batch_size=512, class_weight=class_weights)

    return model

定义训练和测试集:

vtrain_set, test_set = train_test_split(data, test_size=0.35, random_state=0)

X_train = train_set[['..... some columns ....']]
y_train = train_set[['success']]

print('Initial dataset shape: ', X_train.shape)
rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train) 
print('undersampled dataset shape: ', X_undersampled.shape)

结果是:

Initial dataset shape:  (1625843, 11)
undersampled dataset shape:  (1970, 11)

最后是神经网络调用:

print (X_undersampled.shape, y_undersampled.shape)
print (X_test.shape, y_test.shape)

model = neural_network(X_undersampled, y_undersampled, 1000, handle_overfit=True)

# evaluate the model
print("\n---------------\nEvaluated on test set:")

scores = model.evaluate(X_test, y_test)
for i in range(len(model.metrics_names)):
    print("%s: %.5f%%" % (model.metrics_names[i], scores[i]*100))

结果是:

(1970, 11) (1970,)
(875454, 11) (875454, 1)
---------------------- 
 chosen class_wieghts are:  [1. 1.]  
 ---------------------
Epoch 1/1000
1970/1970 [==============================] - 4s 2ms/step - loss: 4.5034 - acc: 0.5147 - f1_m: 0.3703 - precision_m: 0.5291 - recall_m: 0.2859

.
.
.
.
Epoch 999/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5705 - acc: 0.7538 - f1_m: 0.7471 - precision_m: 0.7668 - recall_m: 0.7296
Epoch 1000/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472

---------------
Evaluated on test set:
875454/875454 [==============================] - 49s 56us/step
loss: 55.35181%
acc: 79.25248%
f1_m: 0.39789%
precision_m: 0.23259%
recall_m: 1.54982%

0 个答案:

没有答案