如何提高特定数据集的验证准确性?

时间:2019-07-04 14:50:37

标签: python keras deep-learning

我正在尝试使用keras训练DNN。我有一个包含4000行数据的数据集,每个数据行都属于一个程序员。

这些功能包括:技能,学习时间,证书,同事人数,...,还有一个薪水列是我的目标。

我尝试训练多个DNN,在大多数情况下,其准确度都接近95及以上,但是验证准确度是一个麻烦。验证准确度从未超过〜40%,我认为这对我的项目准确性是个问题。 / p>

为提高验证准确性并减少过度拟合,我尝试通过切断一些不相关的功能(从400到〜50)来减小输入大小。另外,我训练了一些具有辍学层的DNN。这些措施使验证准确性有所提高(最高:47%),但结果不令人满意。

输入是一个numpy数组,看起来像这样:

array([4, 1, 3, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0])

前五个索引是:“年龄”,“工作状态”,“教育程度”,“合作者”,“经验”,这些都是编码数据。

其余的是:“ project_language”,“ work_field”,“ workplace_type”,它们具有多个值,因此我对它们进行了编码。 (热点是分类的吗?)

目标形状为“(4000,8)”,它是一类热门向量:

array([0, 0, 0, 0, 1, 0, 0, 0])

所以我训练了约144个DNN,最佳结果是47%的验证准确性:

activation_list = ['relu', 'tanh']
loss_list = ['categorical_crossentropy']
optimizer_list = ['rmsprop', 'adam']
drop_list = [None, 0.25, 0.5]
nn_arc_list = [
    (32, 16),
    (128, 64),
    (256, 64),
    (128, 32, 16),
    (1024, 256, 64),
    (1024, 256, 64, 16),
    (1024, 512, 128, 64),
    (2048, 1024, 512, 256, 128, 64, 32, 16),
    (4096, 1024, 256, 128, 64, 32, 16, 8),
    (128, 64, 32, 32, 16, 16, 8),
    (1024, 1024, 512, 512, 64, 64),
    (128, 128, 64, 64, 64, 64),
    ]

我该怎么做才能提高验证准确性?

我的train.py代码:

# load training and test data
with open('./ml_dataset.pickle', 'rb') as my_file:
    (train_x, train_y), (test_x, test_y) = pickle.load(my_file)


def create_nn(neurons_architecture, activation, optimizer, loss, dropout):
    model = Sequential()
    input_shape = train_x.shape[1:]

    # add first layer. this layer is added separately because we want to define
    # input shape in it
    model.add(Dense(neurons_architecture[0],
                    activation=activation,
                    input_shape=input_shape))
    if dropout is not None:
        model.add(Dropout(dropout))

    # add the rest of layers
    for neurons in neurons_architecture[1:]:
        model.add(Dense(neurons, activation=activation))
        if dropout is not None:
            model.add(Dropout(dropout))

    # add the last layer and compile
    model.add(Dense(8, activation='softmax'))
    model.compile(optimizer=optimizer,
                  loss=loss,
                  metrics=['accuracy'])
    return model


activation_list = ['relu', 'tanh']
loss_list = ['categorical_crossentropy']
optimizer_list = ['rmsprop', 'adam']
drop_list = [None, 0.25, 0.5]
nn_arc_list = [
    (32, 16),
    (128, 64),
    (256, 64),
    (128, 32, 16),
    (1024, 256, 64),
    (1024, 256, 64, 16),
    (1024, 512, 128, 64),
    (2048, 1024, 512, 256, 128, 64, 32, 16),
    (4096, 1024, 256, 128, 64, 32, 16, 8),
    (128, 64, 32, 32, 16, 16, 8),
    (1024, 1024, 512, 512, 64, 64),
    (128, 128, 64, 64, 64, 64),
    ]

all_states = product(nn_arc_list, activation_list, loss_list, optimizer_list, drop_list)

for state in all_states:
    neuron_arc, activation, loss = state[0], state[1], state[2]
    optimizer, dropout = state[3], state[4]
    model = create_nn(neuron_arc, activation, optimizer, loss, dropout)

    history = model.fit(train_x, train_y, epochs=number_of_epochs,
                        batch_size=256, validation_split=0.2,
                        shuffle=True, verbose=False)

    # I removed the code the generates file_name to reduce the code size
    with open(file_name, 'wb') as my_file:
        pickle.dump(history, my_file)

0 个答案:

没有答案