状态,在Multiagnet环境中的每步奖励

时间:2019-10-02 14:59:42

标签: reinforcement-learning

(交叉发布:https://ai.stackexchange.com/questions/15693/state-reward-per-step-in-a-multiagnet-environment

在单个座席环境中,座席采取行动,然后观察下一个状态并获得奖励:

for ep in num_episodes:
    action = dqn.select_action(state)
    next_state, reward = env.step(action)

简单地说,用于将模拟(env)向前移动的嵌入在env.step()函数内部。

现在在多主体场景中,主体1($ a_1 $)必须在时间$ t_ {1a} $做出决定,该决定将在时间$ t_ {2a} $结束,而主体2($ a_2 $)在时间$ t_ {1b} t_ {2a} $完成。

如果他们的两个动作都将同时开始和结束,则可以轻松地将其实现为:

for ep in num_episodes:
    action1, action2 = dqn.select_action([state1, state2])
    next_state_1, reward_1, next_state_2, reward_2 = env.step([action1, action2])

因为env可以并行执行,等到完成后再返回下一个状态和奖励。但是在我先前描述的场景中,目前尚不清楚如何实现(至少对我而言)。在这里,我们需要明确地跟踪时间,在任何时间点进行检查,以查看代理是否需要做出决定,具体来说:

for ep in num_episodes:
    for t in total_time:
       action1 = dqn.select_action(state1)
       env.step(action1) # this step might take 5t to complete. 
       as such, the step() function won't return the reward till 5 t later. 
        #In the mean time, agent 2 comes and has to make a decision. its reward and next step won't be observed till 10 t later. 

总而言之,一个人如何实现具有每个代理异步操作/奖励的多代理环境?

0 个答案:

没有答案