掩盖了keras BLSTM

时间:2016-06-14 16:33:38

标签: python machine-learning keras lstm

我正在运行基于IMDB example的BLSTM,但我的版本不是分类,而是标签的序列预测。为简单起见,您可以将其视为POS标记模型。输入是单词的句子,输出是标签。示例中使用的语法与大多数其他Keras示例的语法略有不同,因为它不使用model.add但是启动序列。我无法弄清楚如何用这种稍微不同的语法添加掩蔽层。

我运行模型并测试它,它运行正常,但它正在预测和评估0的准确性,这是我的填充。这是代码:

from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers.core import Masking
from keras.layers import TimeDistributed, Dense
from keras.layers import Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
from keras.utils import np_utils, generic_utils


np.random.seed(1337)  # for reproducibility
nb_words = 20000  # max. size of vocab
nb_classes = 10  # number of labels
hidden = 500  # 500 gives best results so far
batch_size = 10  # create and update net after 10 lines
val_split = .1
epochs = 15

# input for X is multi-dimensional numpy array with IDs,
# one line per array. input y is multi-dimensional numpy array with
# binary arrays for each value of each label.
# maxlen is length of longest line
print('Loading data...')
(X_train, y_train), (X_test, y_test) = prep_scan(
    nb_words=nb_words, test_len=75)

print(len(X_train), 'train sequences')
print(int(len(X_train)*val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')

# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')

# this embedding layer will transform the sequences of integers
# into vectors
embedded = Embedding(nb_words, output_dim=hidden,
                     input_length=maxlen)(sequence)

# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
                 go_backwards=True)(embedded)

# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
after_dp = Dropout(0.15)(merged)

# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
    Dense(output_dim=nb_classes,
          activation='softmax'))(after_dp)

model = Model(input=sequence, output=output)

# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
              metrics=['accuracy'], optimizer='adam')

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=epochs,
          shuffle=True,
          validation_split=val_split)

更新:

我将此PR合并,并在嵌入层中使用mask_zero=True。但是我现在意识到,在看到模型的可怕性能之后我还需要在输出中进行屏蔽,其他人建议在model.fit行中使用sample_weight。我怎么能这样做才能忽略0?

更新2:

所以我读了this并将sample_weight想象成1和0的矩阵。我认为它可能一直在工作,但我的准确性在%50左右停止,我发现它正在尝试预测填充部分,但现在不会将它们预测为0,就像使用sample_weight之前的问题一样。

当前代码:

from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers.core import Masking
from keras.layers import TimeDistributed, Dense
from keras.layers import Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
from keras.utils import np_utils, generic_utils
import itertools
from itertools import chain
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd


np.random.seed(1337)  # for reproducibility
nb_words = 20000  # max. size of vocab
nb_classes = 10  # number of labels
hidden = 500  # 500 gives best results so far
batch_size = 10  # create and update net after 10 lines
val_split = .1
epochs = 10

# input for X is multi-dimensional numpy array with syll IDs,
# one line per array. input y is multi-dimensional numpy array with
# binary arrays for each value of each label.
# maxlen is length of longest line
print('Loading data...')
(X_train, y_train), (X_test, y_test), maxlen, sylls_ids, tags_ids, weights = prep_scan(nb_words=nb_words, test_len=75)

print(len(X_train), 'train sequences')
print(int(len(X_train) * val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')

# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')

# this embedding layer will transform the sequences of integers
# into vectors of size 256
embedded = Embedding(nb_words, output_dim=hidden,
                     input_length=maxlen, mask_zero=True)(sequence)

# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
                 go_backwards=True)(embedded)

# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
# after_dp = Dropout(0.)(merged)

# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
    Dense(output_dim=nb_classes,
          activation='softmax'))(merged)

model = Model(input=sequence, output=output)

# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
              metrics=['accuracy'], optimizer='adam',
              sample_weight_mode='temporal')

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=epochs,
          shuffle=True,
          validation_split=val_split,
          sample_weight=weights)

1 个答案:

答案 0 :(得分:1)

你解决了这个问题吗?我不太清楚你的代码如何处理填充值和单词索引。如何让单词索引从1开始并定义

embedded = Embedding(nb_words + 1, output_dim=hidden,
                 input_length=maxlen, mask_zero=True)(sequence)

而不是

embedded = Embedding(nb_words, output_dim=hidden,
                 input_length=maxlen, mask_zero=True)(sequence)

根据https://keras.io/layers/embeddings/