我有一个不平衡的数据集(只有0.06%的数据标记为1,其余的标记为0)。正如我研究的那样,我必须对数据进行重新采样,因此我使用imblearn
包来randomUnserSample
我的数据集。然后,我使用Keras Sequential
创建了一个神经网络。在训练期间,F1Score
会增加到75%左右(第1000个时期的结果是:损失:0.5691-acc:0.7543-f1_m:0.7525-precision_m:0.7582-召回率m:0.7472),但是对于测试集,结果令人失望(损失:55.35181%,acc:79.25248%,f1_m:0.39789%,precision_m:0.23259%,recall_m:1.54982%)。
我假设是在火车上,因为1和0的数目相同,因此class_wights都设置为1,因此对于错误的1s预测,网络不会花费太多。
我使用了一些技术,例如减少层数,减少神经元数量,使用正则化和辍学,但是测试集f1Score
从未超过0.5%。我该怎么办。谢谢
我的神经网络:
def neural_network(X, y, epochs_count=3, handle_overfit=False):
# create model
model = Sequential()
model.add(Dense(12, input_dim=len(X_test.columns), activation='relu'))
if (handle_overfit):
model.add(Dropout(rate = 0.5))
model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l1(0.1)))
if (handle_overfit):
model.add(Dropout(rate = 0.1))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc', f1_m, precision_m, recall_m])
# change weights of the classes '0' and '1' and set weights automatically
class_weights = class_weight.compute_class_weight('balanced', [0, 1], y)
print("---------------------- \n chosen class_wieghts are: ", class_weights, " \n ---------------------")
# Fit the model
model.fit(X, y, epochs=epochs_count, batch_size=512, class_weight=class_weights)
return model
定义训练和测试集:
vtrain_set, test_set = train_test_split(data, test_size=0.35, random_state=0)
X_train = train_set[['..... some columns ....']]
y_train = train_set[['success']]
print('Initial dataset shape: ', X_train.shape)
rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train)
print('undersampled dataset shape: ', X_undersampled.shape)
结果是:
Initial dataset shape: (1625843, 11)
undersampled dataset shape: (1970, 11)
最后是神经网络调用:
print (X_undersampled.shape, y_undersampled.shape)
print (X_test.shape, y_test.shape)
model = neural_network(X_undersampled, y_undersampled, 1000, handle_overfit=True)
# evaluate the model
print("\n---------------\nEvaluated on test set:")
scores = model.evaluate(X_test, y_test)
for i in range(len(model.metrics_names)):
print("%s: %.5f%%" % (model.metrics_names[i], scores[i]*100))
结果是:
(1970, 11) (1970,)
(875454, 11) (875454, 1)
----------------------
chosen class_wieghts are: [1. 1.]
---------------------
Epoch 1/1000
1970/1970 [==============================] - 4s 2ms/step - loss: 4.5034 - acc: 0.5147 - f1_m: 0.3703 - precision_m: 0.5291 - recall_m: 0.2859
.
.
.
.
Epoch 999/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5705 - acc: 0.7538 - f1_m: 0.7471 - precision_m: 0.7668 - recall_m: 0.7296
Epoch 1000/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472
---------------
Evaluated on test set:
875454/875454 [==============================] - 49s 56us/step
loss: 55.35181%
acc: 79.25248%
f1_m: 0.39789%
precision_m: 0.23259%
recall_m: 1.54982%