为什么赢得了我琐碎的LSTM过度装?

时间:2017-12-19 06:10:37

标签: machine-learning tensorflow lstm rnn

我创造了一个非常微不足道的LSTM来尝试预测一个短序列,但它不会过度拟合并以我期望的方式接近零损失。

相反,它只会收缩〜1.5左右,即使它确实有足够的自由度来逐字学习这个序列。

import tensorflow as tf
import time

tf.logging.set_verbosity(tf.logging.DEBUG)

#
# Training data, just a single sequence
#
train_input = [[0, 1, 2, 3, 4, 5, 0, 6, 7, 0]]
train_output = [[1, 2, 3, 4, 5, 0, 6, 7, 8, 0]]

#
# Training metadata
#
batch_size = 1
sequence_length = 10
n_classes = 9

# Network size
rnn_cell_size = 10
rnn_layers = 2
embedding_rank = 3

#
# Training hyperparameters
#
epochs = 100
n_batches = 100
learning_rate = 0.01

#
# Model
#
features = tf.placeholder(tf.int32, [None, sequence_length], name="features")
embeddings = tf.Variable(tf.random_uniform([n_classes, embedding_rank], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, features)
cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_cell_size) for i in range(rnn_layers)])
initial_state = cell.zero_state(batch_size, tf.float32)
cell, _ = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# Convert sequences x batches x outputs to (sequences * batches) x outputs
flat_lstm_output = tf.reshape(cell, [-1, rnn_cell_size])
output = tf.contrib.layers.fully_connected(inputs=flat_lstm_output, num_outputs=n_classes)
softmax = tf.nn.softmax(output)

#
# Training
#
targets = tf.placeholder(tf.int32, [None, sequence_length])
# Convert sequences x batches x targets to (sequences * batches) x targets
flat_targets = tf.reshape(targets, [-1])
loss = tf.losses.sparse_softmax_cross_entropy(flat_targets, softmax)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(epochs):
        loss_sum = 0
        epoch_start = time.time()
        for j in range(n_batches):
            _, step_loss = sess.run([train_op, loss], {
                    features: train_input,
                    targets: train_output,
            })
            loss_sum = loss_sum + step_loss
        print('avg_loss', loss_sum / n_batches, 'avg_time', (time.time() - epoch_start) / n_batches)

我觉得这里缺少一些非常基本的东西 - 我做错了什么?

修改

我试图进一步简化它,现在我要回到以下更简单的例子(也没有收敛):

import tensorflow as tf
import time

tf.logging.set_verbosity(tf.logging.DEBUG)

#
# Training data, just a single sequence
#
train_input = [0, 1, 2, 3, 4]
train_output = [1, 2, 3, 4, 5]

#
# Training metadata
#
batch_size = 1
sequence_length = 5
n_classes = 6

#
# Training hyperparameters
#
epochs = 100
n_batches = 100
learning_rate = 0.01

#
# Model
#
features = tf.placeholder(tf.int32, [None])
one_hot = tf.contrib.layers.one_hot_encoding(features, n_classes)
output = tf.contrib.layers.fully_connected(inputs=one_hot, num_outputs=10)
output = tf.contrib.layers.fully_connected(inputs=output, num_outputs=n_classes)

#
# Training
#
targets = tf.placeholder(tf.int32, [None])
one_hot_targets = tf.one_hot(targets, depth=n_classes)
loss = tf.losses.softmax_cross_entropy(one_hot_targets, output)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(epochs):
        loss_sum = 0
        epoch_start = time.time()
        for j in range(n_batches):
            _, step_loss = sess.run([train_op, loss], {
                    features: train_input,
                    targets: train_output,
            })
            loss_sum = loss_sum + step_loss
        print('avg_loss', loss_sum / n_batches, 'avg_time', (time.time() - epoch_start) / n_batches)

3 个答案:

答案 0 :(得分:0)

您是否检查了学习率的较低值(例如,0.001或0.0001)?

答案 1 :(得分:0)

您的网络不适合(更不用说过度拟合),因为您没有足够的数据。 LSTM只有一个序列,MLP有5个数据点。

将此与您需要估算的参数数量进行比较:您的MLP有120个参数(如果我正确计数)。除非你非常幸运,否则你无法用5个数据点来估算所有这些数据点。 (你可以通过将你的序列分成更小的批次来使它更有可能收敛,但即便如此,它也不会经常收敛)。

简而言之,神经网络需要相当数量的数据才能使用。

答案 2 :(得分:0)

答案是三倍。

1)如果我用tanh替换完全连接的层(relu)中的默认激活,则没有RNN的示例会收敛。

这似乎是因为relu忽略了很多输入(一切都低于零)并且根本没有提供渐变。有了更多的输入,它可能会起作用。

2)带有RNN的示例需要在完全使用None的情况下去除最终完全连接层(在softmax之前)中的激活 - 它不能很好地(或者在大多数组合中)很好地收敛激活softmax前面的完全连接层。

3)RNN示例还需要删除显式softmax,因为sparse_softmax_cross_entropy已经应用了softmax。

最后工作代码:

import tensorflow as tf
import time

tf.logging.set_verbosity(tf.logging.DEBUG)

#
# Training data, just a single sequence
#
train_input = [[0, 1, 2, 3, 4, 5, 0, 6, 7, 0]]
train_output = [[1, 2, 3, 4, 5, 0, 6, 7, 8, 0]]

#
# Training metadata
#
batch_size = 1
sequence_length = 10
n_classes = 9

# Network size
rnn_cell_size = 10
rnn_layers = 2
embedding_rank = 3

#
# Training hyperparameters
#
epochs = 100
n_batches = 100
learning_rate = 0.01

#
# Model
#
features = tf.placeholder(tf.int32, [None, sequence_length], name="features")
embeddings = tf.Variable(tf.random_uniform([n_classes, embedding_rank], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, features)
cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(rnn_cell_size) for i in range(rnn_layers)])
initial_state = cell.zero_state(batch_size, tf.float32)
cell, _ = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# Convert [batche_size, sequence_length, rnn_cell_size] to [(batch_size * sequence_length), rnn_cell_size]
flat_lstm_output = tf.reshape(cell, [-1, rnn_cell_size])
output = tf.contrib.layers.fully_connected(inputs=flat_lstm_output, num_outputs=n_classes, activation_fn=None)

#
# Training
#
targets = tf.placeholder(tf.int32, [None, sequence_length])
# Convert [batch_size, sequence_length] to [batch_size * sequence_length]
flat_targets = tf.reshape(targets, [-1])
loss = tf.losses.sparse_softmax_cross_entropy(flat_targets, output)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(epochs):
        loss_sum = 0
        epoch_start = time.time()
        for j in range(n_batches):
            _, step_loss = sess.run([train_op, loss], {
                    features: train_input,
                    targets: train_output,
            })
            loss_sum = loss_sum + step_loss
        print('avg_loss', loss_sum / n_batches, 'avg_time', (time.time() - epoch_start) / n_batches)