Question

我正在使用Keras进行简单的DQN RL算法，但在网络中使用LSTM。这个想法是有状态的LSTM将记住来自所有先前状态的相关信息，从而更好地预测不同行为的奖励。这个问题比RL更像是一个问题。我认为有状态的LSTM没有被我正确处理。

MODEL CODE - 使用的功能api：

state_input = Input( batch_shape=(batch_size,look_back,1,resolution[0], resolution[1]))

conv1 = TimeDistributed(Conv2D(8, 6, strides=3, activation='relu', data_format="channels_first"))(
    state_input)  # filters, kernal_size, stride
conv2 = TimeDistributed(Conv2D(8, 3, strides=2, activation='relu', data_format="channels_first"))(
    conv1)  # filters, kernal_size, stride
flatten = TimeDistributed(Flatten())(conv2)

fc1 = TimeDistributed(Dense(128,activation='relu'))(flatten)
fc2 = TimeDistributed(Dense(64, activation='relu'))(fc1)
lstm_layer = LSTM(4, stateful=True)(fc2)
fc3 = Dense(128, activation='relu')(lstm_layer)
fc4 = Dense(available_actions_count)(fc3)

model = keras.models.Model(input=state_input, output=fc4)
adam = RMSprop(lr= learning_rate)#0.001
model.compile(loss="mse", optimizer=adam)
print(model.summary())

这是模型摘要：

Layer (type)                 Output Shape              Param 
=================================================================
input_1 (InputLayer)         (1, 1, 1, 30, 45)         0         
_________________________________________________________________
time_distributed_1 (TimeDist (1, 1, 8, 9, 14)          296       
_________________________________________________________________
time_distributed_2 (TimeDist (1, 1, 8, 4, 6)           584       
_________________________________________________________________
time_distributed_3 (TimeDist (1, 1, 192)               0         
_________________________________________________________________
time_distributed_4 (TimeDist (1, 1, 128)               24704     
_________________________________________________________________
time_distributed_5 (TimeDist (1, 1, 64)                8256      
_________________________________________________________________
lstm_1 (LSTM)                (1, 4)                    1104      
_________________________________________________________________
dense_3 (Dense)              (1, 128)                  640       
_________________________________________________________________
dense_4 (Dense)              (1, 8)                    1032      
=================================================================
Total params: 36,616
Trainable params: 36,616
Non-trainable params: 0
================================================================

我一次喂一帧以适合模型。当我需要预测行动时，我确保保存模型状态并将其恢复如下。

编写/训练模型的代码：

#save the state (lstm memory) for recovery before fitting.
prev_state = get_model_states(model)
target_q = model.predict(s1, batch_size=batch_size)
#lstm predict updates the state of the lstm modules
q_next = model.predict(s2, batch_size=batch_size)
max_q_next = np.max(q_next, axis=1)
target_q[np.arange(target_q.shape[0]), a] = r + discount_factor * (1 - isterminal) * max_q_next
#now recover states for fitting the model correctly
set_model_states(model,prev_state)#to before s1 prediction
model.fit(s1, target_q,batch_size=batch_size, verbose=0)
#after fitting, the state and weights both get updated !! 
#so lstm has already moved forward in the sequence

该模型似乎根本不起作用。不同时期的差异仍然很大。正如人们所期望的那样，我在每一集之后重置模型。有状态不会影响剧集之间的训练。每集以一帧一帧的形式进行，这就是我需要有状态的原因。

我尝试过不同的折扣因素和学习率。理论上，这应该是香草dqn的优秀模型（CNN有4帧）我究竟做错了什么？任何帮助，将不胜感激。

Keras LSTM在强化学习中有状态

0 个答案: