Question

我目前正在研究一个问题，我需要在一个健身房环境中为不同的运动员依次采取行动。玩家是同质的，因此我想训练一个模型，该模型可用于顺序决定不同玩家的动作。我之所以选择这种方法，是因为我不想为不同的同类玩家训练单个模型，并且当我想扩展到数百名玩家时，使用多离散动作空间会失败。

我想通过以下方式做到这一点：

对于n个玩家：

Select state of the player n and determine action
Perform action and update state for player n
Determine reward based on the new state of player n

通常在RL中，“更新状态”用于确定下一个动作。但是，在我的问题中，我想根据单个玩家的状态选择一个动作，该动作不同于其他玩家的状态（目标不同）。

我的问题是：