Question

我正在尝试根据以下论文来实现对局部p的关注：https://arxiv.org/pdf/1508.04025.pdf具体来说，方程式（9）基于对某些非线性函数的S形进行推导，然后将结果乘以时间步数。当sigmoid返回0到1之间的值时，此乘法将产生一个介于0和时间步长之间的有效索引。我可以对其进行四舍五入以推断出预测位置，但是，由于tf.cast（）不可微分，因此我找不到将其转换为整数以在切片/索引操作中使用的方法。另一个问题是派生位置的形状为（B，1），因此批次中每个示例的位置都对齐。请参阅下文了解这些操作：

"""B = batch size, S = sequence length (num. timesteps), V = vocabulary size, H = number of hidden dimensions"""
class LocalAttention(Layer):
    def __init__(self, size, window_width=None, **kwargs):
        super(LocalAttention, self).__init__(**kwargs)
        self.size = size
        self.window_width = window_width # 2*D

    def build(self, input_shape): 
        self.W_p = Dense(units=input_shape[2], use_bias=False)
        self.W_p.build(input_shape=(None, None, input_shape[2])) # (B, 1, H)
        self._trainable_weights += self.W_p.trainable_weights

        self.v_p = Dense(units=1, use_bias=False)
        self.v_p.build(input_shape=(None, None, input_shape[2])) # (B, 1, H)
        self._trainable_weights += self.v_p.trainable_weights

        super(Attention, self).build(input_shape)

    def call(self, inputs):
        sequence_length = inputs.shape[1]
        ## Get h_t, the current (target) hidden state ##
        target_hidden_state = Lambda(function=lambda x: x[:, -1, :])(inputs) # (B, H)
        ## Get h_s, source hidden states ##
        aligned_position = self.W_p(target_hidden_state) # (B, H)
        aligned_position = Activation('tanh')(aligned_position) # (B, H)
        aligned_position = self.v_p(aligned_position) # (B, 1)
        aligned_position = Activation('sigmoid')(aligned_position) # (B, 1)
        aligned_position = aligned_position * sequence_length # (B, 1)

让我们说aligned_position张量具有元素[24.2，15.1，12.3]，用于批处理大小= B = 3，用于简化。然后，源隐藏状态是从输入隐藏状态（B = 3，S，H）派生的，因此对于第一个示例，我们采用从24开始的时间步长，因此沿first_batch_states = Lambda(function=lambda x: x[:, 24:, :])(inputs)的方向依次类推。注意，local-p注意的实现比这要复杂得多，但是我在这里简化了它。因此，主要的挑战是将24.2转换为24而不丢失可分性，或者使用某种掩码操作通过点积获取索引。掩码操作是首选方法，因为我们必须为每个示例批量执行此操作，并且在自定义Keras层中包含一个循环并不整齐。您对如何完成此任务有任何想法吗？我将不胜感激任何答复和评论！

Answer 1

我发现有两种方法可以解决此问题。

根据原始问题中显示的对齐位置将高斯分布应用于注意力权重，从而使过程具有差异性，如@Siddhant建议的那样：

gaussian_estimation = lambda s: tf.exp(-tf.square(s - aligned_position) /
                                                   (2 * tf.square(self.window_width / 2)))
gaussian_factor = gaussian_estimation(0)
for i in range(1, sequence_length):
    gaussian_factor = Concatenate()([gaussian_factor, gaussian_estimation(i)])
# Adjust weights via gaussian_factor: (B, S*) to allow differentiability
attention_weights = attention_weights * gaussian_factor # (B, S*)

应该注意，这里没有硬切片操作，只是根据距离进行简单调整。

按照@Vlad在这里How to implement a custom keras layer that only keeps the top n values and zeros out all the rest?的建议，保留前n个值并将其余的值清零：

aligned_position = self.W_p(inputs) # (B, S, H)
aligned_position = Activation('tanh')(aligned_position) # (B, S, H)
aligned_position = self.v_p(aligned_position) # (B, S, 1)
aligned_position = Activation('sigmoid')(aligned_position) # (B, S, 1)
## Only keep top D values out of the sigmoid activation, and zero-out the rest ##
aligned_position = tf.squeeze(aligned_position, axis=-1) # (B, S)
top_probabilities = tf.nn.top_k(input=aligned_position,
                                k=self.window_width,
                                sorted=False) # (values:(B, D), indices:(B, D))
onehot_vector = tf.one_hot(indices=top_probabilities.indices,
                           depth=sequence_length) # (B, D, S)
onehot_vector = tf.reduce_sum(onehot_vector, axis=1) # (B, S)
aligned_position = Multiply()([aligned_position, onehot_vector]) # (B, S)
aligned_position = tf.expand_dims(aligned_position, axis=-1) # (B, S, 1)
source_hidden_states = Multiply()([inputs, aligned_position]) # (B, S*=S(D), H)
## Scale back-to approximately original hidden state values ##
aligned_position += 1 # (B, S, 1)
source_hidden_states /= aligned_position # (B, S*=S(D), H)

应该注意，这里我们将密集层应用于所有隐藏的源状态，以得到(B,S,1)的形状，而不是(B,1)的{{1}}的形状。我相信这与本文所建议的尽可能接近。

任何试图实施注意力机制的人都可以检查我的仓库https://github.com/ongunuzaymacar/attention-mechanisms。此处的层是专为多对一序列任务设计的，但是可以进行细微调整以适应其他形式。

注意实现的预测一致性中的差异问题

1 个答案: