Python 3.6.8
tf.version:“ 2.1.0-rc1”
tf.keras.version:“ 2.2.4-tf”
我正在尝试修改用于语音识别任务的官方转换器模型。
网址:https://github.com/tensorflow/models/blob/master/official/transformer/v2/transformer_main.py。
如语音变压器论文(http://150.162.46.34:8080/icassp2018/ICASSP18_USB/pdfs/0005884.pdf)中所述, 我在编码器堆栈之前添加了2个Conv2D层,并修改了用于处理2D输入的输入数据管道。
当我训练模型时,损失从8.XX开始,然后很好地收敛到几乎0.38倍。 但是,当我加载模型并开始训练时,损失又从8.XX开始。
您可以在编码器堆栈之前检查其他层。
class Embed(tf.keras.layers.Layer):
def __init__(self, params, cmvn_mean, cmvn_std):
super(Embed, self).__init__()
self.params = params
self.params = params
self.linear_projection = None
self.cmvn_mean = None
self.cmvn_std = None
self.mean = cmvn_mean
self.std = cmvn_std
self.conv1 = None
self.conv2 = None
self.bn1 = None
self.bn2 = None
self.reshape_to_conv = None
self.flatten = None
def build(self, input_shape):
params = self.params
# I added some layers for embedding
self.linear_projection = \
tf.keras.layers.TimeDistributed(
tf.keras.layers.Dense(
params["hidden_size"], activation='linear'))
self.cmvn_mean = tf.keras.backend.constant(self.mean)
self.cmvn_std = tf.keras.backend.constant(self.std)
self.cmvn_layer = tf.keras.layers.Lambda(_normalize)
# TODO: stride and filter number are hard coded.
stride=2
filter_number=64
self.conv1 = tf.keras.layers.Conv2D(filters=filter_number, kernel_size=(3,3),
activation='relu', padding='same',
strides=(stride, stride))
self.conv2 = tf.keras.layers.Conv2D(filters=filter_number, kernel_size=(3,3),
activation='relu', padding='same',
strides=(stride, stride))
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()
self.reshape_to_conv = tf.keras.layers.Reshape((-1,
params["feature_dimension"],
1))
self.flatten = \
tf.keras.layers.Reshape((-1, int(params["feature_dimension"] / (stride * 2))
* filter_number))
super(Embed, self).build(input_shape)
def get_config(self):
return {
"params": self.params,
}
def call(self, inputs):
"""
from the Speech-Transformer paper,
We firstly stack two 3×3 CNN layers with stride 2 for both time and
frequency dimensions to prevent the GPU memory overflow and produce the
approximate hidden representation length with the character length.
"""
norm_input = self.cmvn_layer([inputs, self.cmvn_mean, self.cmvn_std])
conv_input = self.reshape_to_conv(norm_input)
conv_input = self.conv1(conv_input)
conv_input = self.bn1(conv_input)
conv_input = self.conv2(conv_input)
conv_input = self.bn2(conv_input)
# TODO: Additional Module (opt) in section 4.2 ==> I'll build big model
"""
Then, we can optionally stack M additional modules which are applied to
extracting more expressive representations for our Speech Transformer,
which will be detailed in section 4.2. ==> skipped
"""
# Linear
"""
Next, we perform a linear transformation on the flattened feature map
outputs to obtain the vectors of dimension d_model, which is called input
encoding here.
"""
flattned_input = self.flatten(conv_input)
return self.linear_projection(flattned_input)