嵌入层“ mask_zero = True”形状错误

时间:2018-08-23 15:34:51

标签: python tensorflow keras embedding ner

我正在尝试在NER的顶部构建带有CRF层的双向LSTM。我正在使用CONLL 2003数据集进行训练。在嵌入层中没有mask_zero=True参数的情况下,我可以训练该模型。但是使用mask_zero=True时出现此错误:

> c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [500,65] vs. [500]
 [[Node: metrics/acc/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](metrics/acc/Cast, metrics/acc/Cast_1)]]

训练数据(x_train)是一个包含单词(句子)列表的列表。 例如。

x_train = [['Sentence', 'number', 'one'], ['Sentence', 'number', 'two']]

然后使用Tokenizer将这些数据转换为数字表示形式。 标签(y_train)是一个列表,其中包含单词的一键编码标签。 x_train(具有0值和y_train(具有[1. 0. 0. 0. 0. 0. 0. 0. 0.]值(O标签)都填充到maxlen

这些是训练数据和馈入模型的标签的形状。

> Input shape: (4995, 65)
Target shape: (4995, 65, 9)

这是代码段:

max_features = 20000
maxlen = 65
MAX_WORDS = 20000
GLOVE_DIR = os.path.join(os.path.dirname(__file__), './glove')
EMBEDDING_DIM = 100
LABELS_CNT = 9

# Labels/tags
labels_index = {}
labels_index['O'] = 0
labels_index['B-ORG'] = 1
labels_index['I-ORG'] = 2
labels_index['B-PER'] = 3
labels_index['I-PER'] = 4
labels_index['B-LOC'] = 5
labels_index['I-LOC'] = 6
labels_index['B-MISC'] = 7
labels_index['I-MISC'] = 8

# Tokenizer
tokenizer = Tokenizer(num_words=MAX_WORDS, filters=[])
tokenizer.fit_on_texts(train_words)
word_index = tokenizer.word_index

def create_embeddings_layer():
    # Load pre-trained GloVe embedding
    embeddings_index = {}
    with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as glove_file:
        for line in glove_file:
            values = line.split()
            word = values[0]
            coeficients = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coeficients

    embedding_matrix = np.zeros((len(word_index) + 2, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    return Embedding(len(word_index) + 2,
                     EMBEDDING_DIM,
                     weights=[embedding_matrix],
                     input_length=maxlen,
                     mask_zero=True,
                     trainable=False)

def build_model(use_crf):
    """
    Builds Bidirectional LSTM-CRF model.
    """

    # Model
    model = Sequential()

    # Embeddings
    embeddings = create_embeddings_layer()
    model.add(embeddings)
    model.add(Dropout(0.5))

    # Bidirectional LSTM
    model.add(Bidirectional(LSTM(units=100, return_sequences=True)))
    model.add(Dense(100, activation='tanh'))

    # CRF
    if use_crf:
        crf = CRF(LABELS_CNT, sparse_target=False)
        model.add(crf)
    else:
        model.add(Dense(LABELS_CNT, activation='sigmoid'))

    loss = 'categorical_crossentropy'

    return model, loss

# Create model
model, loss = build_model(true)
model.compile('adam', loss, metrics=['accuracy'])

# Training
model.fit(x_train, y_train, batch_size=500, epochs=15)

有什么不好的建议吗?

0 个答案:

没有答案