应用错误收集

时间：2016-09-26 00:44:52

标签： neural-network artificial-intelligence deep-learning encog q-learning

我正在尝试为乒乓球比赛实施深度q学习算法。我已经使用表作为Q函数实现了Q学习。它运作良好，并学习如何在10分钟内击败天真的AI。但我无法使其发挥作用使用神经网络作为Q函数逼近器。

我想知道我是否走在正确的轨道上，所以这里是我正在做的事情的摘要：

我将当前状态，所采取的行动和奖励作为当前体验存储在重播记忆中
我使用多层感知器作为Q功能，其中1个隐藏层具有512个隐藏单元。输入 - ＆gt;隐藏层我正在使用sigmoid激活功能。隐藏 - ＆gt;输出层I使用线性激活函数
状态由球员和球的位置以及球的速度表示。位置被重新映射到更小的状态空间。
我正在使用epsilon-greedy方法来探索epsilon逐渐降至0的状态空间。
在学习时，选择随机批次的32个后续经历。然后我计算所有当前状态和动作Q（s，a）的目标q值。

forall Experience e in batch if e == endOfEpisode target = e.getReward else target = e.getReward + discountFactor*qMaxPostState end

现在我有一组32个目标Q值，我正在使用批量梯度下降训练具有这些值的神经网络。我只是做了一个训练步骤。我该做多少？

我使用Java编程并使用Encog进行多层感知器实现。问题是训练非常缓慢，表现非常弱。我想我错过了一些东西，但无法弄清楚是什么。我希望至少有一个不错的结果，因为表格方法没有问题。

答案 0 :(得分：3)

Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps. Sometimes ADAM works fine.
Reduce the number of hidden units. It might be just too much.
Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
Instead of the velocity, you can give it the previous state as an additional input. The network can calculate the difference between those 2 steps.

答案 1 :(得分：2)

我正在使用多层感知器作为Q功能，其中1个隐藏层具有512个隐藏单元。

可能太大了。取决于您的输入/输出维度和问题。你尝试过少了吗？

完整性检查

网络是否可以学习必要的功能？

收集地面实况输入/输出。以受监督的方式适应网络。它能提供所需的输出吗？

常见错误是让最后一次激活功能出错。大多数情况下，您需要线性激活功能（如您所愿）。然后你希望网络尽可能小，因为RL很不稳定：你可以有99次运行它不起作用，1可以运行它。

我有足够的探索吗？

检查你的探索量。也许你需要更多的探索，特别是在开始时？