Question

我无法确定我的Deep Q网络的下一步。我正在尝试优化公交路线。我有一个距离矩阵和停止流行度的数据。

distance矩阵是一个二维数组，其中所有stop都详细说明了它们之间的距离。如果有4个停靠点，它将如下所示：

distance=np.array[[0, stop1-stop2, stop1-stop3, stop1-stop4],
                 [stop2-stop1, 0, stop2-stop3, stop2-stop4],
                 [stop3-stop1, stop3-stop2, 0, stop3-stop4],
                 [stop4-stop1, stop4-stop2, stop4-stop3, 0]]

rewards矩阵很简单：

(1/distance) * (percent of total riders who get on and off at specific stop)

这是为了确保距离较短的停车点和大量车手具有最高的奖励价值。

我为每个stop都开设了课程。这些显示了每个站点有多少人在等待，并定期与更多人更新站点。当公交车“到达”停靠站时，其waiters的值变为0，因此它的奖励变为0，直到有更多人到达为止。

我使用以下代码设置了模型：

    import tensorflow as tf

    # Current game states. Rows of the rewards matrix corresponding to   the agent's current stop. Inputs to neural network.
    observations = tf.placeholder('float32', shape=[None, num_stops])

    # Actions. A number from 0-number of stops, denoting which stop the agent traveled to from its current location.
    actions = tf.placeholder('int32',shape=[None])

    # These are the rewards received by the agent for making its decisions. +1 if agent 'wins' the game (gets system score to 0 (this will only happen if bus stops are not updated periodically))
    rewards = tf.placeholder('float32',shape=[None])  # +1, -1 with discounts


# Model


    # This is first layer of neural network, takes the observations tensor as input and has '200' hidden layers. This number is arbitrary, I'm not sure how to adjust it for peak performance.
    Y = tf.layers.dense(observations, 200, activation=tf.nn.relu)

从这里我不确定该怎么办。我想分批运行神经网络，而不是在每次公共汽车行动（从一个停靠站到另一个停靠站）之后都更新权重。相反，我想等到完整的“游戏”完成，例如游戏结束前，巴士执行了预定数量的操作。如果游戏赢了，例如公共汽车在预定的时间内到过每个站点，将给予奖励。我正在考虑使用1来使事情保持简单。较早的操作将以折扣率折扣。

我之所以这样想，是因为我希望代理人看到其个人行为的长期影响。我在一篇有关代理商学习打乒乓球的论文中看到了这一点，并且正在尝试实现类似的代理商来玩我的系统。预先感谢您的帮助。

Python Tensorflow DQN后续步骤

0 个答案: