用于文本生成的Keras LSTM模型

时间:2019-01-21 11:25:35

标签: python keras neural-network lstm

我是Keras的初学者,正在编写神经网络模型,实际上我正试图为文本生成目的编写LSTM,但没有成功。我在做什么错了?

我阅读了以下问题:here 和其他文章,但有些东西我无法得到,抱歉,如果我看起来很笨。

目标

我的目的是生成固定长度的英语文章(目前1500)。

假设我有一个20k个记录数据集,按顺序排列(长度基本上是文章),我为所有文章(MAX_SEQUENCE_LENGTH=1500)设置了固定的长度,并将它们标记化,得到一个矩阵(X,我的训练数据)如下:

[[   0    0    0 ...   88  664  206]
 [   0    0    0 ...    1   93  140]
 [   0    0    0 ...    3  173 2283]
 ...
 [  50 2761    4 ...  167  148  156]
 [   0    0    0 ...   10   77  206]
 [   0    0    0 ...  167  148  156]]

形状为20000x1500
我的LSTM的输出应该是1 x MAX_SEQUENCE_LENGTH个令牌数组。

我的模型如下:

def generator_model(sequence_input, embedded_sequences, output_shape):
    layer = LSTM(16,return_sequences = True)(embedded_sequences)
    layer = LSTM(32,return_sequences = True)(layer)
    layer = Flatten()(layer)
    output = Dense(output_shape, activation='softmax')(layer)
    generator = Model(sequence_input, output)
    return generator

与:
sequence_input = Input(batch_shape=(1, 1,1500), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
output_shape = MAX_SEQUENCE_LENGTH

LSTM应该使用model.fit()20k x MAX_SEQUENCE_LENGTH形状(X)的训练集上进行训练。

并在我调用1 x MAX_SEQUENCE_LENGTH时获得具有model.predict(seed)形状的令牌数组作为输出,其中seed是随机噪声数组。

编译,拟合和预测

对以下部分的评论:
generator.compile有效,该模型在帖子的edit部分中给出。
generator.fit编译,epochs=1参数用于测试,将为BATCH_NUM
。现在我对我给y的{​​{1}}有一些疑问,如果我生成的形状与generator.fit不同,那么我现在给出的矩阵为0作为目标输出,则会引发错误,这意味着它需要为X.shape[0]中的每个记录添加一个标签。但是,如果我给他一个X的矩阵,如0的{​​{1}},难道不是只能预测0的数组吗?
。尽管我使用了targetmodel.fit,但错误始终是相同的,我相信这是因为它不喜欢我给出的noise_generator()参数

noise_integer_generator()

但是实际上我遇到了以下错误

y_shape

当我打电话时:

embedding_layer = load_embeddings(word_index)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
embedded_sequences = embedding_layer(sequence_input)
generator = generator_model(sequence_input, embedded_sequences, X.shape[1])
print(generator.summary())
generator.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
Xnoise = generate_integer_noise(MAX_SEQUENCE_LENGTH)
y_shape = np.zeros((X.shape[0],), dtype=int)
generator.fit(X, y_shape, epochs=1)
acc = generator.predict(Xnoise, verbose=1)

我给出的噪声是一个ValueError: Error when checking input: expected input_1 to have shape (1500,) but got array with shape (1,) 数组,但似乎是一个Xnoise = generate_noise(samples_number=MAX_SEQUENCE_LENGTH) generator.predict(Xnoise, verbose=1) 矩阵,因此输出的形状设置一定存在某种错误。

我的模型对我的目的是否正确?还是我写了一些我看不到的非常愚蠢的东西?

感谢您能给我的帮助,谢谢!

编辑

变更日志:

1 x 1500

(1500,)代码:

v1.
###
- Changed model structure, now return_sequences = True and using shape instead of batch_shape
###
- Changed 
sequence_input = Input(batch_shape=(1,1,1500), dtype='int32')
to
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
###
- Changed the error the model is giving

v2.
###
- Changed generate_noise() code
###
- Added generate_integer_noise() code
###
- Added full sequence with the model compile, fit and predict
###
- Added model.fit summary under the model summary, in the tail of the post

其中打印:generate_noise()

def generate_noise(samples_number, mean=0.5, stdev=0.1): noise = np.random.normal(mean, stdev, (samples_number, MAX_SEQUENCE_LENGTH)) print(noise.shape) return noise 代码:

(1500,)

我的功能generate_integer_noise()如下:

def generate_integer_noise(samples_number):
    noise = []
    for _ in range(0, samples_number):
        noise.append(np.random.randint(1, MAX_NB_WORDS))
    Xnoise = np.asarray(noise)
    return Xnoise

模型摘要:

load_embeddings()


def load_embeddings(word_index, embeddingsfile='Embeddings/glove.6B.%id.txt' %EMBEDDING_DIM): embeddings_index = {} f = open(embeddingsfile, 'r', encoding='utf8') for line in f: values = line.split(' ') #split the line by spaces word = values[0] #each line starts with the word coefs = np.asarray(values[1:], dtype='float32') #the rest of the line is the vector embeddings_index[word] = coefs #put into embedding dictionary f.close() print('Found %s word vectors.' % len(embeddings_index)) embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False) return embedding_layer 摘要(使用999个大小的数据集进行测试,安装大小为2万个):

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1500)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1500, 300)         9751200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 1500, 16)          20288     
_________________________________________________________________
lstm_2 (LSTM)                (None, 1500, 32)          6272      
_________________________________________________________________
flatten_1 (Flatten)          (None, 48000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1500)              72001500  
=================================================================
Total params: 81,779,260
Trainable params: 72,028,060
Non-trainable params: 9,751,200
_________________________________________________________________

1 个答案:

答案 0 :(得分:0)

我重写了完整的答案,现在它可以工作了(至少可以编译并运行,不能说任何关于收敛的问题)。

首先,我不知道您为什么使用sparse_categorical_crossentropy而不是categorical_crossentropy?这可能很重要。我对模型进行了一些更改,因此可以编译并使用categorical_crossentropy。如果您需要sparse,请更改目标的形状。

此外,我将batch_shape更改为shape参数,因为它允许使用不同形状的批处理。使用起来更容易。

最后一次编辑:您应该更改generate_noise,因为嵌入层会等待(0, max_features)中的数字,而不是正态分布的浮点数(请参见函数中的注释)。

编辑
针对最后的评论,我删除了generate_noise并发布了经过修改的generate_integer_noise函数:

from keras.layers import Input, Embedding, LSTM
from keras.models import Model
import numpy as np


def generate_integer_noise(samples_number):
    """
    samples_number is a number of samples, i.e. first dimension in (some, 1500)
    """
    return np.random.randint(1, MAX_NB_WORDS, size=(samples_number, MAX_SEQUENCE_LENGTH))

MAX_SEQUENCE_LENGTH = 1500
"""
Tou can use your definition of embedding layer, 
I post to make a reproducible example
"""
max_features, embed_dim = 10, 300
embedding_matrix = np.zeros((max_features, embed_dim))
output_shape = MAX_SEQUENCE_LENGTH

embedded_layer = Embedding(
    max_features,
    embed_dim,
    weights=[embedding_matrix],
    trainable=False
)


def generator_model(embedded_layer, output_shape):
    """
    embedded_layer: Embedding keras layer
    output_shape: shape of the target
    """
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH, ))
    embedded_sequences = embedded_layer(sequence_input)   # Set trainable to the True if you wish to train

    layer = LSTM(32, return_sequences=True)(embedded_sequences)
    layer = LSTM(64, return_sequences=True)(layer)
    output = LSTM(output_shape)(layer)

    generator = Model(sequence_input, output)
    return generator


generator = generator_model(embedded_layer, output_shape)

noise = generate_integer_noise(32)

# generator.predict(noise)
generator.compile(loss='categorical_crossentropy', optimizer='adam')
generator.fit(noise, noise)