Question

考虑以下问题

给定奖励，策略使用线性回归找到 5x5 网格世界的最佳价值函数。

<块引用>

奖励：

[[0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0]]

政策：

[['R', 'R', 'R', 'R', 'S'],
 ['R', 'R', 'R', 'R', 'U'],
 ['R', 'R', 'R', 'R', 'U'],
 ['R', 'R', 'R', 'R', 'U'],
 ['R', 'R', 'R', 'R', 'U']]

函数逼近的代码

输入 (i, j) 被转换为维度为 9 的向量，以便更好地逼近。

def features(x, y):
    return np.array([
        1, x, y, abs(x-y), x*y, x**2, y**2,
        x**2+y**2, (x+y)**2
    ]).reshape(9, 1)

我正在使用 Keras 制作线性回归模型。

model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_dim=9))
model.compile(optimizer='adam', loss='MSE')

训练代码 训练是基于 TD(0) 公式完成的：

def temporal_diff_fun(i, j):
    next_i, next_j = take_action(i, j, policy[i][j])
    feats = features(i, j).reshape(1, 9)
    next_feats = features(next_i, next_j).reshape(1, 9)
    g = reward[i, j] + gamma * model.predict(next_feats)
    return alpha * (g - model.predict(feats))

for _ in range(100):
    for i in range(grid_size):
        for j in range(grid_size):
            preds = temporal_diff_fun(i, j)
            feats = features(i, j).reshape(1, 9)
            model.fit(feats, preds, epochs=1, verbose=0)

问题是这种训练没有给出最佳值。

附注

预期输出（没有函数逼近）

[[6.56019806, 7.28919806, 8.09919806, 8.99919806, 9.99919806],
 [5.90409806, 6.56019806, 7.28919806, 8.09919806, 8.99919806],
 [5.31360806, 5.90409806, 6.56019806, 7.28919806, 8.09919806],
 [4.78216706, 5.31360806, 5.90409806, 6.56019806, 7.28919806],
 [4.30387016, 4.78216706, 5.31360806, 5.90409806, 6.56019806]]

强化学习：TD(0) 使用函数逼近

0 个答案: