如何采取最佳行动,而不是采取随机行动

时间:2019-04-05 06:31:25

标签: python python-3.x reinforcement-learning

我的代理不断采取随机行动,因此算法无法正确训练。如何确保它采取了存储在“ next_action,ArgMax = custom_argmax(Q_value)”行中的最佳操作。函数custom_argmax计算为每个状态(动作对)找到的最大Q值。

max_episodes = 10
max_steps_per_episode = 1000

discount_rate = 0.99
exploration_rate = 0.5
max_exploration_rate = 1
min_exploration_rate = 0.1
learning_rate = 0.01
explore_decay_rate = 0.2
errors = []


def play_single_game(max_steps_per_episode, render):
    global errors

    state = env.reset()
    # print('We are resetting: ' )

    action = env.action_space.sample()

    for step in range(max_steps_per_episode - 1):

        # if episode == max_episodes - 1:
        if render:
            env.render()

        # print("This is the Ac:",  a)
        '''
        if step%2 == 0:
            a = 1
        else:
            a = 1
        '''
        new_state, reward, done, info = env.step(action)  # declare all, gets new state from taking certain action
        # print(info)
        next_state = new_state
        # print(reward)
        old_weights = weights.theta.copy()

        if done == True:
            weights.theta += learning_rate * (reward - weights_multiplied_by_features(state, action)) * feature_space(state, action)
            # print("we are done")
            break
        else:
            # not finished
            Q_value= associated_Q_value(next_state)

            exploration_rate_threshold = random.uniform(0, 1)

            next_action, ArgMax = custom_argmax(Q_value)  # is best action

            if exploration_rate_threshold < exploration_rate:  # take random

                r = random.randint(0, len(LEGAL_MOVES) - 1)

                next_action = r

            # we will update Q(s,a) AS we experience the episode
            weights.theta += learning_rate * (reward + discount_rate * ArgMax - weights_multiplied_by_features(state, action)) * feature_space(state, action)

            # next state becomes current state
            state = next_state
            action = next_action

            change_in_weights = np.abs(weights.theta - old_weights).sum()
            errors.append(change_in_weights)

1 个答案:

答案 0 :(得分:1)

您正在进行epsilon贪婪探索。显然,您已经设置了exploration_rate = 0.5,因此您的代理将始终采取50%的随机操作。这可能太高了,但这并不意味着您的经纪人没有学习。

如果您想正确评估您的特工,则必须在禁用探索的情况下运行情节。您不能只是禁用随机动作,因为那样的话,它可能永远不会尝试其他动作。这就是所谓的勘探/开发权衡。但是,您可以在业务代表正在学习时(例如, exploration_rate *= 0.999在您的循环中,或类似。