简单的 DQN 来减缓训练

时间:2021-01-03 19:58:02

标签: python tensorflow keras deep-learning openai-gym

我一直在尝试使用本文中的 DQN 来解决 OpenAI 月球着陆器游戏


问题是训练 50 集需要 12 个小时,所以一定有问题。

import os
import random
import gym
import numpy as np
from collections import deque
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Model

ENV_NAME = "LunarLander-v2"




class MyModel(Model):
    def __init__(self, input_size, output_size):
        super(MyModel, self).__init__()
        self.d1 = Dense(128, input_shape=(input_size,), activation="relu")
        self.d2 = Dense(128, activation="relu")
        self.d3 = Dense(output_size, activation="linear")

    def call(self, x):
        x = self.d1(x)
        x = self.d2(x)
        return self.d3(x)

class DQNSolver():

    def __init__(self, observation_space, action_space):
        self.exploration_rate = EXPLORATION_MAX

        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = MyModel(observation_space,action_space)
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
        batch = random.sample(self.memory, BATCH_SIZE)
        state_batch, q_values_batch = [], []
        for state, action, reward, state_next, terminal in batch:
            # q-value prediction for a given state
            q_values_cs = self.model.predict(state)
            # target q-value
            max_q_value_ns = np.amax(self.model.predict(state_next)[0])
            # correction on the Q value for the action used
            if terminal:
                q_values_cs[0][action] = reward
                q_values_cs[0][action] = reward + DISCOUNT_FACTOR * max_q_value_ns
        # train the Q network
                        batch_size = BATCH_SIZE,
                        epochs = 1, verbose = 0)
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)

def lunar_lander():
    env = gym.make(ENV_NAME)
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    dqn_solver = DQNSolver(observation_space, action_space)
    episode = 0
    while True:
        episode += 1
        state = env.reset()
        state = np.reshape(state, [1, observation_space])
        scores = []
        score = 0
        while True:
            action = dqn_solver.act(state)
            state_next, reward, terminal, _ = env.step(action)
            state_next = np.reshape(state_next, [1, observation_space])
            dqn_solver.remember(state, action, reward, state_next, terminal)
            state = state_next
            score += reward
            if terminal:
                print("Episode: " + str(episode) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(score))
        if np.mean(scores[-min(100, len(scores)):]) >= 195:
            print("Problem is solved in {} episodes.".format(episode))
if __name__ == "__main__":


这是 GPU 统计数据

| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   53C    P2    46W / 198W |   7718MiB /  8111MiB |      0%      Default |
|                               |                      |                  N/A |

如您所见,TensorFlow 不在 GPU 上进行计算,而是保留了内存,所以我假设这是因为神经网络的输入太小,而是使用 CPU。

为了确保 GPU 安装正确,我从他们的文档中运行了一个示例,它使用了 GPU。您也可以在 google colab here



有没有办法在这种情况下利用 GPU?


事实证明,代理学习飞行而不是学习着陆,所以我增加了 150 的最大步长以限制剧集时间,但它仍然很慢。

在笔记本设置中启用 GPU 的 colab 上运行我能够使用 wandb.ai 监控 GPU 使用情况,它正在尝试使用 GPU,但利用率为 0%。所以我能够复制这个问题,因为它对我来说也运行得很慢。根据 this postalso this one,我的最佳猜测是 env 中的 step 函数正在阻碍进程。我尝试将批量大小增加 10 倍,看看是否会影响速度或 GPU 使用,但没有成功。对于这个网络的大小和结构,CPU 可能更高效,因此这是唯一被使用的东西。这并不能解释为什么你链接的论文在几个小时内就有结果,比我们得到的时间短得多,除非他们使用的笔记本电脑的 CPU 明显更快。