为什么LSTM模型在每次运行中都不会产生相同的最终权重,而初始权重总是相同?

时间:2019-04-09 12:59:50

标签: tensorflow keras python-3.6

如果您查看代码片段,我们将相同的代码用于具有2000个Epoch的多次运行,并且没有辍学来避免随机丢弃和随机权重选择。

但是,每次运行后我们仍然得到不同的结果。我们检查了初始权重,并且在每次运行中(即在运行开始期间)它们都是相同的。

在每次运行中,我们都看到了不同的结果。 例如:以下显示的输出描述val_loss直到最后一个时期才有所改善。

Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
1994
--
Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
1995
--
Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
1996
--
Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
1997
--
Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
1998
--
Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
1999
--
Epoch 00000: val_loss did not improve
3s - loss: 0.0251 - val_loss: 0.0276
[2019-04-07 18:28:17,495 - - DEBUG -my_project_model.py:317 -             fit_lstm() ] Time taken: 126.07314221905544 min

对于相同的数据集和相同的代码段,输出是不同的。示例日志如下所示。而且val_loss在285个纪元内比前一个要少得多。我们很困惑后台到底发生了什么。

3s - loss: 0.0044 - val_loss: 0.0011
271
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0043 - val_loss: 0.0011
272
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0043 - val_loss: 0.0011
273
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0043 - val_loss: 9.5030e-04
274
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 9.7404e-04
275
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 0.0010
276
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0044 - val_loss: 9.6836e-04
277
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 0.0011
278
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 0.0010
279
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 0.0010
280
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 0.0011
281
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0042 - val_loss: 8.9629e-04
282
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0041 - val_loss: 9.8693e-04
283
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0041 - val_loss: 9.4584e-04
284
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0041 - val_loss: 0.0011
285
--
Epoch 1/1
Epoch 00000: val_loss did not improve
3s - loss: 0.0041 - val_loss: 9.8990e-04

这被认为是随机路径;但有时输出可能会匹配。但是它从来没有匹配过。我们怀疑辍学可能会导致更多随机性,因此我们将其从代码段中删除了。以上结果来自实时示例,下面给出了用于该示例的代码段。

关于自由的信息

[id@ip~]$ source activate projectcondaenv

(projectcondaenv) [id@ip~]$ conda list | grep -i keras
dist-keras                0.2.1                     <pip>
keras                     2.0.5                    py36_0  

(projectcondaenv) [id@ip~]$ conda list | grep -i tensor
tensorflow                1.3.0                         0  
tensorflow-base           1.3.0            py36h5293eaa_1  
tensorflow-tensorboard    0.1.5                    py36_0 

从配置文件读取变量,它们如下:

nb_epoch=2000
batch_size=1
neurons=15

代码如下所示:

from numpy.random import seed
from keras.layers import Dense, LSTM, TimeDistributed, Dropout
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
#from keras.callbacks import TensorBoard
from keras.models import Sequential
from keras.models import load_model
#from keras.constraints import NonNeg
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from pathlib import Path
from datetime import timedelta
#from time import time

from exceptions.model_file_not_found_exception import  ModelFileNotFoundException
from exceptions.data_not_found_exception import  DataNotFoundException


logger = logging.getLogger(__name__)

tf.set_random_seed(1234)
seed(1)


class MyProject(object):
    def fit_lstm(self, train, batch_size, nb_epoch, neurons, test=None, load_model=False):
        import timeit
        try:
            start = timeit.default_timer()
            X, y = train[:, 0:-1], train[:, -1]
            X = X.reshape(X.shape[0], 1, X.shape[1])

            if test.any():
                X_test, y_test = test[:, 0:-1], test[:, -1]
                X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])

            model = Sequential()
            model.add(LSTM(neurons, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True,
                           return_sequences=True))
            #model.add(Dropout(self.dropout_1)) #commented
            model.add(LSTM(neurons, stateful=True))
            #model.add(Dropout(self.dropout_2)) #commented
            model.add(Dense(1))
            model.compile(loss=self.loss, optimizer=self.optimizer)

            if load_model:
                pass

                # callbacks
            c = [
                ModelCheckpoint(self.checkpoint_dir+self.model_filename, save_best_only=True,
                                                monitor='val_loss', mode='min', verbose=1, period=1),
                EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1),
                ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=self.min_lr)
            ]

            for i in range(nb_epoch):
                print(i)
                model.fit(X, y, epochs=1, batch_size=batch_size, verbose=2, shuffle=False, validation_data=(X_test, y_test),
                          callbacks=c)
                model.reset_states()
            time_taken = timeit.default_timer() - start
            logger.debug('Time taken: ' + str(time_taken/60) + ' min')
            model = self.get_latest_model()
            return model, round(time_taken/60, 2)
        except Exception as err:
            logger.error('Fit LSTM Method failed with Errors .. '+str(err))
            logger.exception('=== Failed to fit the LSTM Model  === ')
            raise err

任何人都可以强调一下可能是什么问题吗?

为什么输出中会有很大的随机性?

是否无法找到全局最小值并停留在局部最小值?请放些灯以帮助我们进一步前进。

我已经提到了几篇文章和Keras问题(如下所示)。但是这些都不能回答问题。

参考:

https://stats.stackexchange.com/questions/255105/why-is-the-validation-accuracy-fluctuating

围绕Keras的问题很少:

https://github.com/keras-team/keras/issues/1597

https://github.com/keras-team/keras/issues/2711

https://github.com/keras-team/keras/issues/11371

1 个答案:

答案 0 :(得分:1)

最近我尝试使用Tensorflow 2.0和它的高级keras API来获得可重复的结果,让我首先说这不是一件容易的事。

我认为您在这里的路线不正确(尽管我无法直接验证),问题是Tensorflow中的继承随机性源

首先,您应尝试将所有内容设置为尽可能具有确定性,为此,请遵循Keras FAQ section on reproducibility。从本质上讲,您必须设置以下内容:

import numpy as np
import tensorflow as tf
import random as rn

SEED=0

# Numpy fixed random seed
np.random.seed(SEED)

# Python's random generator
rn.seed(SEED)

# Tensorflow has to use one thread (multiple threads might give you different results)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
                              inter_op_parallelism_threads=1)

from keras import backend as K

# Set Tensorflow random seed
tf.set_random_seed(1234)

# Create default graph without parallelism
K.set_session(tf.Session(graph=tf.get_default_graph(), config=session_conf))

此外,您必须设置环境变量PYTHONHASHSEED=0,然后再从CLI运行Python脚本,如下所示:

$ PYTHONHASHSEED=0 python my_script.py

如果仍然无法获得可重复的结果,则可能要归咎于CUDA。只是要确保它也没有禁用它(仅出于测试目的):

$ CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 python my_script.py

您可以逐步介绍每个更改,以便尽可能消除不确定性因素。我会这样:

  • 设置所有种子和PYTHONHASHSEED,并检查结果是否不同
  • 使用session_conf删除操作之间的并行性
  • 最后禁用CUDA(如果先前的操作无济于事)

除此之外(假设您使用相同的初始权重运行模型,并且验证拆分始终相同),则可能是这些框架see this issue固有的设计错误。