Question

我有一个不平衡的数据集（只有0.06％的数据标记为1，其余的标记为0）。正如我研究的那样，我必须对数据进行重新采样，因此我使用imblearn包来randomUnserSample我的数据集。然后，我使用Keras Sequential创建了一个神经网络。在训练期间，F1Score会增加到75％左右（第1000个时期的结果是：损失：0.5691-acc：0.7543-f1_m：0.7525-precision_m：0.7582-召回率m：0.7472），但是对于测试集，结果令人失望（损失：55.35181％，acc：79.25248％，f1_m：0.39789％，precision_m：0.23259％，recall_m：1.54982％）。

我假设是在火车上，因为1和0的数目相同，因此class_wights都设置为1，因此对于错误的1s预测，网络不会花费太多。

我使用了一些技术，例如减少层数，减少神经元数量，使用正则化和辍学，但是测试集f1Score从未超过0.5％。我该怎么办。谢谢

我的神经网络：

def neural_network(X, y, epochs_count=3, handle_overfit=False):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=len(X_test.columns), activation='relu'))
    if (handle_overfit):
        model.add(Dropout(rate = 0.5))
    model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l1(0.1)))
    if (handle_overfit):
        model.add(Dropout(rate = 0.1))
    model.add(Dense(1, activation='sigmoid'))

    # compile the model
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['acc', f1_m, precision_m, recall_m])

#     change weights of the classes '0' and '1' and set weights automatically
    class_weights = class_weight.compute_class_weight('balanced', [0, 1], y)
    print("---------------------- \n chosen class_wieghts are: ", class_weights, " \n ---------------------")

    # Fit the model
    model.fit(X, y, epochs=epochs_count, batch_size=512, class_weight=class_weights)

    return model

定义训练和测试集：

vtrain_set, test_set = train_test_split(data, test_size=0.35, random_state=0)

X_train = train_set[['..... some columns ....']]
y_train = train_set[['success']]

print('Initial dataset shape: ', X_train.shape)
rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train) 
print('undersampled dataset shape: ', X_undersampled.shape)

结果是：

Initial dataset shape:  (1625843, 11)
undersampled dataset shape:  (1970, 11)

最后是神经网络调用：

print (X_undersampled.shape, y_undersampled.shape)
print (X_test.shape, y_test.shape)

model = neural_network(X_undersampled, y_undersampled, 1000, handle_overfit=True)

# evaluate the model
print("\n---------------\nEvaluated on test set:")

scores = model.evaluate(X_test, y_test)
for i in range(len(model.metrics_names)):
    print("%s: %.5f%%" % (model.metrics_names[i], scores[i]*100))

结果是：

(1970, 11) (1970,)
(875454, 11) (875454, 1)
---------------------- 
 chosen class_wieghts are:  [1. 1.]  
 ---------------------
Epoch 1/1000
1970/1970 [==============================] - 4s 2ms/step - loss: 4.5034 - acc: 0.5147 - f1_m: 0.3703 - precision_m: 0.5291 - recall_m: 0.2859

.
.
.
.
Epoch 999/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5705 - acc: 0.7538 - f1_m: 0.7471 - precision_m: 0.7668 - recall_m: 0.7296
Epoch 1000/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472

---------------
Evaluated on test set:
875454/875454 [==============================] - 49s 56us/step
loss: 55.35181%
acc: 79.25248%
f1_m: 0.39789%
precision_m: 0.23259%
recall_m: 1.54982%

处理不平衡数据集中的过度拟合

0 个答案: