我正在使用Open AI的体育馆环境以及Christian Kauten的马里奥代码来教它如何玩游戏。但是我在下面显示的第二个代码块中遇到以下错误,其中操作给出了以下错误:Expected type 'int', got 'ndarray[int]' instead
。
运行代码时,我还会得到:“ Python[19457:566482] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to (null)"
。我该如何解决?
Q学习算法如下:
action_space_size = len(SIMPLE_MOVEMENT)
state_space_size = 10000
q_table = np.zeros((state_space_size, action_space_size))
max_episodes = 1
max_steps_per_episode = 10
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
explore_decay_rate = 0.01
reward_all_episodes = []
curr_available_action = 0
curr_available_state = 0
for episode in range(max_episodes):
state = env.reset()
#dict_state[state] = curr_available_state
curr_available_state += 1
done = False
rewards_curr_episode = 0
for step in range(max_steps_per_episode):
exploration_rate_threshold = random.uniform(0, 1)
if exploration_rate_threshold > exploration_rate:
action = np.argmax(q_table[state, :])
print(action)
else:
action = env.action_space.sample()
new_state, reward, done, info = env.step(action)
q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (
reward + discount_rate * np.max(q_table[new_state, :]))
state = new_state
rewards_curr_episode += reward
if done == True:
break
exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(
- explore_decay_rate * episode)
reward_all_episodes.append(rewards_curr_episode)
玩马里奥的代码如下:
for episode in range(1):
state = env.reset()
done = False
print("***Episode", episode + 1, "***\n\n\n\n\n")
time.sleep(1)
rewards_curr_episode = 0
for step in range(max_steps_per_episode):
# clear_output(wait = True)
env.render()
time.sleep(0.3)
action = np.argmax(q_table[state, :])
new_state, reward, done, info = env.step(action)
rewards_curr_episode += reward
if done:
# clear_output(wait=True)
env.render()
print("Total Reward is: " + reward)
time.sleep(3)
break
state = new_state
env.close()
动作定义如下:
SIMPLE_MOVEMENT = [
['NOOP'],
['right'],
['right', 'A'],
['right', 'B'],
['right', 'A', 'B'],
['A'],
['left'],
]