我目前正在使用此代码来自one discussion on github 这是注意机制的代码:
_input = Input(shape=[max_length], dtype='int32')
# get the embedding layer
embedded = Embedding(
input_dim=vocab_size,
output_dim=embedding_size,
input_length=max_length,
trainable=False,
mask_zero=False
)(_input)
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
probabilities = Dense(3, activation='softmax')(sent_representation)
这是正确的方法吗?我有点期待时间分布层的存在,因为注意机制分布在RNN的每个时间步骤中。我需要有人确认这个实现(代码)是一个正确的注意机制实现。谢谢。
答案 0 :(得分:15)
如果你想引起时间维度的注意,那么这部分代码对我来说似乎是正确的:
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
你已经制定了形状(batch_size, max_length)
的注意力向量:
attention = Activation('softmax')(attention)
我以前从未见过这段代码,所以我不能说这个代码是否真的是正确的:
K.sum(xin, axis=-2)
进一步阅读(你可以看看):
答案 1 :(得分:1)
最近我正在在密集层上应用注意力机制,这是一个示例实现:
def baseline_model():
input_dims = train_data_X.shape[1]
inputs = Input(shape=(input_dims,))
dense1800 = Dense(1800, activation='relu', kernel_regularizer=regularizers.l2(0.01))(inputs)
attention_probs = Dense( 1800, activation='sigmoid', name='attention_probs')(dense1800)
attention_mul = multiply([ dense1800, attention_probs], name='attention_mul')
dense7 = Dense(7, kernel_regularizer=regularizers.l2(0.01), activation='softmax')(attention_mul)
model = Model(input=[inputs], output=dense7)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
print (model.summary)
model.fit( train_data_X, train_data_Y_, epochs=20, validation_split=0.2, batch_size=600, shuffle=True, verbose=1)
答案 2 :(得分:0)
注意机制注意句子的不同部分:
activations = LSTM(units, return_sequences=True)(embedded)
它通过
确定该句子的每个隐藏状态的贡献attention = Dense(1, activation='tanh')(activations)
attention = Activation('softmax')(attention)
最后注意不同的状态:
sent_representation = merge([activations, attention], mode='mul')
我不太了解这一部分:sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
答案 3 :(得分:0)
我认为您可以尝试使用以下代码在LSTM网络中添加keras自我注意机制
from keras_self_attention import SeqSelfAttention
inputs = Input(shape=(length,))
embedding = Embedding(vocab_size, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)(inputs)
lstm = LSTM(num_lstm, input_shape=(X[train].shape[0], X[train].shape[1]), return_sequences=True)(embedding)
attn = SeqSelfAttention(attention_activation='sigmoid')(lstm)
Flat = Flatten()(attn)
dense = Dense(32, activation='relu')(Flat)
outputs = Dense(3, activation='sigmoid')(dense)
model = Model(inputs=[inputs], outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val,y_val), shuffle=True)
答案 4 :(得分:0)
虽然给出了许多不错的选择,但我试图修改您共享的代码以使其起作用。我还回答了您到目前为止尚未解决的其他查询:
Q1。这是正确的方法吗? 注意层本身看起来不错。无需更改。您可以稍微简化和修改使用注意力层输出的方式,以结合最近的一些框架升级。
sent_representation = merge.Multiply()([activations, attention])
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)
您现在可以出发了!
Q2。我有点期待时间分布层的存在,因为关注机制分布在RNN的每个时间步中。
否,您不需要时间分布层,否则权重将在各个时间步长之间共享,而这不是您想要的。
您可以参考:https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e了解其他具体信息