Question

前言

我试图解决this Windy-Grid-World环境。实现了Q和Q（λ）算法，结果几乎相同（我看每集的步骤）。

问题：

从我读过的内容来看，我认为更高的lambda参数应该更多地更新状态，然后再导致它;因此，与常规Q学习相比，步骤的数量应该大大减少。这个image显示了我在说什么。

这个环境是正常的还是我错误地实现了？

代码：

import matplotlib.pyplot as plt
import numpy as np
from lib.envs.windy_gridworld import WindyGridworldEnv
from collections import defaultdict

env = WindyGridworldEnv()


def epsilon_greedy_policy(Q, state, nA, epsilon):
    '''
    Create a policy in which epsilon dictates how likely it will 
    take a random action.

    :param Q: links state -> action value (dictionary)
    :param state: state character is in (int)
    :param nA: number of actions (int)
    :param epsilon: chance it will take a random move (float)
    :return: probability of each action to be taken (list)
    '''
    probs = np.ones(nA) * epsilon / nA
    best_action = np.argmax(Q[state])
    probs[best_action] += 1.0 - epsilon

    return probs

def Q_learning_lambda(episodes, learning_rate, discount, epsilon, _lambda):
    '''
    Learns to solve the environment using Q(λ)

    :param episodes: Number of episodes to run (int)
    :param learning_rate: How fast it will converge to a point (float [0, 1])
    :param discount: How much future events lose their value (float [0, 1])
    :param epsilon: chance a random move is selected (float [0, 1])
    :param _lambda: How much credit to give states leading up to reward (float [0, 1])

    :return: x,y points to graph
    '''

    # Link state to action values
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    # Eligibility trace
    e = defaultdict(lambda: np.zeros(env.action_space.n))

    # Points to plot
    # number of episodes
    x = np.arange(episodes)
    # number of steps
    y = np.zeros(episodes)

    for episode in range(episodes):
        state = env.reset()

        # Select action
        probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon)
        action = np.random.choice(len(probs), p=probs)

        for step in range(10000):

            # Take action
            next_state, reward, done, _ = env.step(action)

            # Select next action
            probs = epsilon_greedy_policy(Q, next_state, env.action_space.n, epsilon)
            next_action = np.random.choice(len(probs), p=probs)

            # Get update value
            best_next_action = np.argmax(Q[next_state])
            td_target = reward + discount * Q[next_state][best_next_action]
            td_error = td_target - Q[state][action]

            e[state][action] += 1

            # Update all states
            for s in Q:
                for a in range(len(Q[s])):

                    # Update Q value based on eligibility trace
                    Q[s][a] += learning_rate * td_error * e[s][a]

                    # Decay eligibility trace if best action is taken
                    if next_action is best_next_action:
                        e[s][a] = discount * _lambda * e[s][a]
                    # Reset eligibility trace if random action taken
                    else:
                        e[s][a] = 0

            if done:
                y[episode] = step
                e.clear()
                break

            # Update action and state
            action = next_action
            state = next_state

    return x, y

如果您想查看整件事，可以查看我的Jupyter Notebook here。

Answer 1

您的实现没有问题。

您为Q（λ）实现的是Watkins版本的Q（λ）。在他的版本中，对于非贪婪的行为，资格跟踪将为零，而对于贪婪的行为，则仅进行备份。如eligibility traces（p25）中所述，沃特金斯（Watkins）Q（λ）的缺点在于，在早期学习中，合格跟踪将被频繁地“削减”（归零），从而对跟踪没有太大帮助。也许这就是为什么您的Q学习和Q（λ）学习具有非常相似的性能的原因。

您可以尝试其他资格跟踪，例如Peng的资格或天真的资格，以检查表演是否有任何提升。

强化学习：Windy Grid World环境中的Q和Q（λ）速度差异

1 个答案: