我对卷积LSTM网络还比较陌生,但是我目前正在研究一个涉及未来帧序列预测的问题,这就是为什么我决定研究ConvLSTM网络的原因。
为了了解模型的工作方式以及如何扩展模型,我尝试了一些关于移动MNIST数据集的初步测试: http://www.cs.toronto.edu/~nitish/unsupervised_video/mnist_test_seq.npy
但是,经过训练和推论后,我本以为预测会更加连贯,尤其是与其他对移动MNIST数据集使用类似方法的人进行比较时。看起来数字的原始轨迹已在输出中“保存”。
这是常见的限制吗?还是我的网络架构针对当前任务设计不正确?
我已阅读并应用了以下文章中给出的代码: https://arxiv.org/abs/1506.04214
,他们也有一个Github页面,在这里我主要将其keras示例用于ConvLSTM单元: https://github.com/wqxu/ConvLSTM
我已将样本大小减小到100以供您复制结果-但是我使用K40 GPU训练了100个历时(大约一个小时)的模型,只是为了查看问题是否与模型完全不相关
我的代码如下(假设您已从上面的链接下载了Moving MNIST数据集,并将其放入了'path'变量中):
from keras.models import Sequential
from keras.layers.convolutional import Conv3D
from keras.layers.convolutional_recurrent import ConvLSTM2D
from keras.layers.normalization import BatchNormalization
import numpy as np
import matplotlib.pyplot as plt
path = "./"
data = np.load(path + 'mnist_test_seq.npy')
# Define image dimensions and frames to be used for LSTM memory
sequence_length = 15
image_height = data.shape[2]
image_width = data.shape[3]
# swap frames and observations so [obs, frames, height, width, channels]
data = data.swapaxes(0, 1)
# only select first 100 observations to reduce memory- and compute requirements
sub = data[:100, :, :, :]
# add channel dimension (grayscale)
sub = np.expand_dims(sub, 4)
# normalize to 0, 1
#sub = sub / 255
sub[sub < 128] = 0
sub[sub>= 128] = 1
# Define network
seq = Sequential()
seq.add(ConvLSTM2D(filters=64, kernel_size=(1,1),
input_shape=(None, image_height, image_width, 1), #Will need to change channels to 3 for real images
padding='same', return_sequences=True,
activation='relu'))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=64, kernel_size=(2,2),
padding='same', return_sequences=True,
activation='relu'))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=64, kernel_size=(1,1),
padding='same', return_sequences=True,
activation='relu'))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=64, kernel_size=(2,2),
padding='same', return_sequences=True,
activation='relu'))
seq.add(BatchNormalization())
seq.add(Conv3D(filters=1, kernel_size=(1,1,1),
activation='sigmoid',
padding='same', data_format='channels_last'))
seq.compile(loss='binary_crossentropy', optimizer='adam')
# Add helper function for shifting input and output, so previous frame (X_t-1) is used as input to predict next frame (y_t)
def shift_data(data, n_frames=15):
X = data[:, 0:n_frames, :, :, :]
y = data[:, 1:(n_frames+1), :, :, :]
return X, y
# Run script
# prepare X, y
X, y = shift_data(sub, sequence_length)
# fit the model
seq.fit(X, y, batch_size=16, epochs=100, validation_split=0.05)
# select a random observation
test_set = np.expand_dims(X[5, :, :, :, :], 0)
# compare to ground truth and visualize
for i in range(0, 13):
# create plot
fig = plt.figure(figsize=(10, 5))
# truth
ax = fig.add_subplot(122)
ax.text(1, -3, ('ground truth at time :' + str(i)), fontsize=20, color='b')
toplot_true = test_set[0, i, ::, ::, 0]
plt.imshow(toplot_true)
# predictions
ax = fig.add_subplot(121)
ax.text(1, -3, ('predicted frame at time :' + str(i)), fontsize=20, color='b')
toplot_pred = prediction[0, i+1, ::, ::, 0]
plt.imshow(toplot_pred)
plt.savefig(path + '/%i_image.png' % (i + 1))
我得到的结果如下:
第一张图片看起来不错 Frame 1
但是,Frame 6和Frame 13框架清楚地向您显示了先前步骤的整个轨迹。
如果您一次可视化所有图像,它也将变得清晰,数字的轨迹也不会从图像中“去除”。
我不确定这是否只是模型的某些已知限制,还是仅仅是模型未收敛。我担心的是,鉴于数据集的相对简单性,这些结果不太令人满意,并且对于模型而言,更复杂的任务将根本不可行。 任何反馈将不胜感激!