我正在尝试在NER的顶部构建带有CRF层的双向LSTM。我正在使用CONLL 2003数据集进行训练。在嵌入层中没有mask_zero=True
参数的情况下,我可以训练该模型。但是使用mask_zero=True
时出现此错误:
> c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [500,65] vs. [500]
[[Node: metrics/acc/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](metrics/acc/Cast, metrics/acc/Cast_1)]]
训练数据(x_train
)是一个包含单词(句子)列表的列表。
例如。
x_train = [['Sentence', 'number', 'one'], ['Sentence', 'number', 'two']]
然后使用Tokenizer
将这些数据转换为数字表示形式。
标签(y_train
)是一个列表,其中包含单词的一键编码标签。 x_train
(具有0
值和y_train
(具有[1. 0. 0. 0. 0. 0. 0. 0. 0.]
值(O
标签)都填充到maxlen
。
这些是训练数据和馈入模型的标签的形状。
> Input shape: (4995, 65)
Target shape: (4995, 65, 9)
这是代码段:
max_features = 20000
maxlen = 65
MAX_WORDS = 20000
GLOVE_DIR = os.path.join(os.path.dirname(__file__), './glove')
EMBEDDING_DIM = 100
LABELS_CNT = 9
# Labels/tags
labels_index = {}
labels_index['O'] = 0
labels_index['B-ORG'] = 1
labels_index['I-ORG'] = 2
labels_index['B-PER'] = 3
labels_index['I-PER'] = 4
labels_index['B-LOC'] = 5
labels_index['I-LOC'] = 6
labels_index['B-MISC'] = 7
labels_index['I-MISC'] = 8
# Tokenizer
tokenizer = Tokenizer(num_words=MAX_WORDS, filters=[])
tokenizer.fit_on_texts(train_words)
word_index = tokenizer.word_index
def create_embeddings_layer():
# Load pre-trained GloVe embedding
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as glove_file:
for line in glove_file:
values = line.split()
word = values[0]
coeficients = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coeficients
embedding_matrix = np.zeros((len(word_index) + 2, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
return Embedding(len(word_index) + 2,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=maxlen,
mask_zero=True,
trainable=False)
def build_model(use_crf):
"""
Builds Bidirectional LSTM-CRF model.
"""
# Model
model = Sequential()
# Embeddings
embeddings = create_embeddings_layer()
model.add(embeddings)
model.add(Dropout(0.5))
# Bidirectional LSTM
model.add(Bidirectional(LSTM(units=100, return_sequences=True)))
model.add(Dense(100, activation='tanh'))
# CRF
if use_crf:
crf = CRF(LABELS_CNT, sparse_target=False)
model.add(crf)
else:
model.add(Dense(LABELS_CNT, activation='sigmoid'))
loss = 'categorical_crossentropy'
return model, loss
# Create model
model, loss = build_model(true)
model.compile('adam', loss, metrics=['accuracy'])
# Training
model.fit(x_train, y_train, batch_size=500, epochs=15)
有什么不好的建议吗?