Question

我正在构建一个使用循环图层（GRU）将字符串转换为另一个字符串的模型。我已经尝试了Dense和TimeDistributed（密集）层作为最后一层，但我不明白使用return_sequences = True时两者之间的区别，特别是因为它们似乎具有相同数量的参数

我的简化模型如下：

InputSize = 15
MaxLen = 64
HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)

网络摘要是：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

这对我来说很有意义，因为我对TimeDistributed的理解是它在所有时间点都应用相同的层，因此Dense图层有16 * 15 + 15 = 255个参数（权重+偏差）。

但是，如果我切换到一个简单的密集层：

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)

我仍然只有255个参数：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
dense_1 (Dense)              (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

我想知道这是否是因为Dense（）只会使用形状中的最后一个维度，并有效地将其他所有维度视为类似批处理的维度。但后来我不再确定Dense和TimeDistributed（密集）之间有什么区别。

更新查看https://github.com/fchollet/keras/blob/master/keras/layers/core.py看起来似乎Dense仅使用最后一个维度来调整自身的大小：

def build(self, input_shape):
    assert len(input_shape) >= 2
    input_dim = input_shape[-1]

    self.kernel = self.add_weight(shape=(input_dim, self.units),

它还使用keras.dot来应用权重：

def call(self, inputs):
    output = K.dot(inputs, self.kernel)

keras.dot的文档暗示它在n维张量上工作正常。我想知道它的确切行为是否意味着Dense（）实际上会在每个时间步被调用。如果是这样，问题仍然是TimeDistributed（）在这种情况下实现的目的。

Answer 1

TimeDistributedDense在GRU / LSTM Cell展开期间对每个时间步应用相同的密度。因此，误差函数将在预测的标签序列和实际的标签序列之间。（这通常是序列标记问题的序列要求）。

但是，如果return_sequences = False，则Dense图层仅在最后一个单元格中应用一次。当RNN用于分类问题时通常就是这种情况。如果return_sequences = True则将Dense图层应用于每个时间步，就像TimeDistributedDense一样。

因此，根据您的模型，两者都相同，但是如果您将第二个模型更改为＆＃34; return_sequences = False＆＃34;然后密集将仅应用于最后一个单元格。尝试更改它，模型将抛出错误，因为Y将是大小[Batch_size，InputSize]，它不再是一个序列序列，而是一个完整的序列来标记问题。

from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np

InputSize = 15
MaxLen = 64
HiddenSize = 16

OutputSize = 8
n_samples = 1000

model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')


model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())
print(model2.summary())
print(model3.summary())

在上面的示例中，model1和model2的体系结构是样本（序列到序列模型），而model3是标记模型的完整序列。

Answer 2

这是一段验证TimeDistirbuted(Dense(X))与Dense(X)相同的代码：

import numpy as np 
from keras.layers import Dense, TimeDistributed
import tensorflow as tf

X = np.array([ [[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9],
                [10, 11, 12]
               ],
               [[3, 1, 7],
                [8, 2, 5],
                [11, 10, 4],
                [9, 6, 12]
               ]
              ]).astype(np.float32)
print(X.shape)

（2，4，3）

dense_weights = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
                          [0.2, 0.7, 0.9, 0.1, 0.2],
                          [0.1, 0.8, 0.6, 0.2, 0.4]])
bias = np.array([0.1, 0.3, 0.7, 0.8, 0.4])
print(dense_weights.shape)

（3，5）

dense = Dense(input_dim=3, units=5, weights=[dense_weights, bias])
input_tensor = tf.Variable(X, name='inputX')
output_tensor1 = dense(input_tensor)
output_tensor2 = TimeDistributed(dense)(input_tensor)
print(output_tensor1.shape)
print(output_tensor2.shape)

（2，4，5）

（2，？，5）

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    output1 = sess.run(output_tensor1)
    output2 = sess.run(output_tensor2)

print(output1 - output2)

区别是：

[[[0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]]

 [[0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]]]

在Keras中的TimeDistributed（Dense）vs Dense - 相同数量的参数

2 个答案: