(交叉发布:https://ai.stackexchange.com/questions/15693/state-reward-per-step-in-a-multiagnet-environment)
在单个座席环境中,座席采取行动,然后观察下一个状态并获得奖励:
for ep in num_episodes:
action = dqn.select_action(state)
next_state, reward = env.step(action)
简单地说,用于将模拟(env)向前移动的嵌入在env.step()函数内部。
现在在多主体场景中,主体1($ a_1 $)必须在时间$ t_ {1a} $做出决定,该决定将在时间$ t_ {2a} $结束,而主体2($ a_2 $)在时间$ t_ {1b}
如果他们的两个动作都将同时开始和结束,则可以轻松地将其实现为:
for ep in num_episodes:
action1, action2 = dqn.select_action([state1, state2])
next_state_1, reward_1, next_state_2, reward_2 = env.step([action1, action2])
因为env可以并行执行,等到完成后再返回下一个状态和奖励。但是在我先前描述的场景中,目前尚不清楚如何实现(至少对我而言)。在这里,我们需要明确地跟踪时间,在任何时间点进行检查,以查看代理是否需要做出决定,具体来说:
for ep in num_episodes:
for t in total_time:
action1 = dqn.select_action(state1)
env.step(action1) # this step might take 5t to complete.
as such, the step() function won't return the reward till 5 t later.
#In the mean time, agent 2 comes and has to make a decision. its reward and next step won't be observed till 10 t later.
总而言之,一个人如何实现具有每个代理异步操作/奖励的多代理环境?