我正在尝试使用keras训练DNN。我有一个包含4000行数据的数据集,每个数据行都属于一个程序员。
这些功能包括:技能,学习时间,证书,同事人数,...,还有一个薪水列是我的目标。
我尝试训练多个DNN,在大多数情况下,其准确度都接近95及以上,但是验证准确度是一个麻烦。验证准确度从未超过〜40%,我认为这对我的项目准确性是个问题。 / p>
为提高验证准确性并减少过度拟合,我尝试通过切断一些不相关的功能(从400到〜50)来减小输入大小。另外,我训练了一些具有辍学层的DNN。这些措施使验证准确性有所提高(最高:47%),但结果不令人满意。
输入是一个numpy数组,看起来像这样:
array([4, 1, 3, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0])
前五个索引是:“年龄”,“工作状态”,“教育程度”,“合作者”,“经验”,这些都是编码数据。
其余的是:“ project_language”,“ work_field”,“ workplace_type”,它们具有多个值,因此我对它们进行了编码。 (热点是分类的吗?)
目标形状为“(4000,8)”,它是一类热门向量:
array([0, 0, 0, 0, 1, 0, 0, 0])
所以我训练了约144个DNN,最佳结果是47%的验证准确性:
activation_list = ['relu', 'tanh']
loss_list = ['categorical_crossentropy']
optimizer_list = ['rmsprop', 'adam']
drop_list = [None, 0.25, 0.5]
nn_arc_list = [
(32, 16),
(128, 64),
(256, 64),
(128, 32, 16),
(1024, 256, 64),
(1024, 256, 64, 16),
(1024, 512, 128, 64),
(2048, 1024, 512, 256, 128, 64, 32, 16),
(4096, 1024, 256, 128, 64, 32, 16, 8),
(128, 64, 32, 32, 16, 16, 8),
(1024, 1024, 512, 512, 64, 64),
(128, 128, 64, 64, 64, 64),
]
我该怎么做才能提高验证准确性?
我的train.py代码:
# load training and test data
with open('./ml_dataset.pickle', 'rb') as my_file:
(train_x, train_y), (test_x, test_y) = pickle.load(my_file)
def create_nn(neurons_architecture, activation, optimizer, loss, dropout):
model = Sequential()
input_shape = train_x.shape[1:]
# add first layer. this layer is added separately because we want to define
# input shape in it
model.add(Dense(neurons_architecture[0],
activation=activation,
input_shape=input_shape))
if dropout is not None:
model.add(Dropout(dropout))
# add the rest of layers
for neurons in neurons_architecture[1:]:
model.add(Dense(neurons, activation=activation))
if dropout is not None:
model.add(Dropout(dropout))
# add the last layer and compile
model.add(Dense(8, activation='softmax'))
model.compile(optimizer=optimizer,
loss=loss,
metrics=['accuracy'])
return model
activation_list = ['relu', 'tanh']
loss_list = ['categorical_crossentropy']
optimizer_list = ['rmsprop', 'adam']
drop_list = [None, 0.25, 0.5]
nn_arc_list = [
(32, 16),
(128, 64),
(256, 64),
(128, 32, 16),
(1024, 256, 64),
(1024, 256, 64, 16),
(1024, 512, 128, 64),
(2048, 1024, 512, 256, 128, 64, 32, 16),
(4096, 1024, 256, 128, 64, 32, 16, 8),
(128, 64, 32, 32, 16, 16, 8),
(1024, 1024, 512, 512, 64, 64),
(128, 128, 64, 64, 64, 64),
]
all_states = product(nn_arc_list, activation_list, loss_list, optimizer_list, drop_list)
for state in all_states:
neuron_arc, activation, loss = state[0], state[1], state[2]
optimizer, dropout = state[3], state[4]
model = create_nn(neuron_arc, activation, optimizer, loss, dropout)
history = model.fit(train_x, train_y, epochs=number_of_epochs,
batch_size=256, validation_split=0.2,
shuffle=True, verbose=False)
# I removed the code the generates file_name to reduce the code size
with open(file_name, 'wb') as my_file:
pickle.dump(history, my_file)