我的代理不断采取随机行动,因此算法无法正确训练。如何确保它采取了存储在“ next_action,ArgMax = custom_argmax(Q_value)”行中的最佳操作。函数custom_argmax计算为每个状态(动作对)找到的最大Q值。
max_episodes = 10
max_steps_per_episode = 1000
discount_rate = 0.99
exploration_rate = 0.5
max_exploration_rate = 1
min_exploration_rate = 0.1
learning_rate = 0.01
explore_decay_rate = 0.2
errors = []
def play_single_game(max_steps_per_episode, render):
global errors
state = env.reset()
# print('We are resetting: ' )
action = env.action_space.sample()
for step in range(max_steps_per_episode - 1):
# if episode == max_episodes - 1:
if render:
env.render()
# print("This is the Ac:", a)
'''
if step%2 == 0:
a = 1
else:
a = 1
'''
new_state, reward, done, info = env.step(action) # declare all, gets new state from taking certain action
# print(info)
next_state = new_state
# print(reward)
old_weights = weights.theta.copy()
if done == True:
weights.theta += learning_rate * (reward - weights_multiplied_by_features(state, action)) * feature_space(state, action)
# print("we are done")
break
else:
# not finished
Q_value= associated_Q_value(next_state)
exploration_rate_threshold = random.uniform(0, 1)
next_action, ArgMax = custom_argmax(Q_value) # is best action
if exploration_rate_threshold < exploration_rate: # take random
r = random.randint(0, len(LEGAL_MOVES) - 1)
next_action = r
# we will update Q(s,a) AS we experience the episode
weights.theta += learning_rate * (reward + discount_rate * ArgMax - weights_multiplied_by_features(state, action)) * feature_space(state, action)
# next state becomes current state
state = next_state
action = next_action
change_in_weights = np.abs(weights.theta - old_weights).sum()
errors.append(change_in_weights)
答案 0 :(得分:1)
您正在进行epsilon贪婪探索。显然,您已经设置了exploration_rate = 0.5
,因此您的代理将始终采取50%的随机操作。这可能太高了,但这并不意味着您的经纪人没有学习。
如果您想正确评估您的特工,则必须在禁用探索的情况下运行情节。您不能只是禁用随机动作,因为那样的话,它可能永远不会尝试其他动作。这就是所谓的勘探/开发权衡。但是,您可以在业务代表正在学习时(例如, exploration_rate *= 0.999
在您的循环中,或类似。