延迟奖励的强化学习模型

时间:2019-04-01 15:56:03

标签: python python-3.x keras neural-network reinforcement-learning

我已经用python模拟了一个游戏。调用游戏时,它将与一个随机玩家,一个决策树玩家和一个RL玩家进行完整循环。当学习者需要做出决定时,游戏将调用run_network()函数,该函数返回一个动作。完整游戏结束时,运行update_reward()。

我的奖励延迟,在奖励之间可以有任何数量的国家和行动。只有肯定的奖励会更新到状态和操作的pos_rewards列表中,并且永远不会删除。

在每局比赛结束时,我都会打印出谁获胜的记录,尽管每局比赛需要一分钟并且会越来越长,但RL球员尚未赢得任何胜利。

from keras.models import model_from_json
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
import Game

def update_reward(pos_rewards):
    vec = np.asarray(pos_rewards)
    x = vec[:, 3:-1]
    y = vec[:, -1]
    y_cat = np.zeros((len(y), 7))
    for i in range(len(y)):
        if y[i] == 0:
            y_cat[i] = [1, 0, 0, 0, 0, 0, 0]
        elif y[i] == 1:
            y_cat[i] = [0, 1, 0, 0, 0, 0, 0]
        elif y[i] == 2:
            y_cat[i] = [0, 0, 1, 0, 0, 0, 0]
        elif y[i] == 3:
            y_cat[i] = [0, 0, 0, 1, 0, 0, 0]
        elif y[i] == 4:
            y_cat[i] = [0, 0, 0, 0, 1, 0, 0]
        elif y[i] == 5:
            y_cat[i] = [0, 0, 0, 0, 0, 1, 0]
        elif y[i] == 6:
            y_cat[i] = [0, 0, 0, 0, 0, 0, 1]

    model.fit(x, y_cat, batch_size=5000, epochs=1, verbose=0)
    model.save_weights(model_file)
    with open(arch_file, "w") as json_file:
        json_file.write(model.to_json())

def run_network(state):
    x = np.array(state[3:])
    x = x.reshape(1, 176)
    p = model.predict(x)
    action = p.argmax(axis=0)
return action

arch_file = 'D:\\model\\rl_arch.json'
model_file = 'D:\\model\\rl_model.h5'
start_new = 0
if start_new == 1:
    model = Sequential()
    model.add(Dense(units=500, activation='relu', input_dim=176))
    ...
    model.add(Dense(units=25, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(7, activation='softmax'))
else:
    with open(f_arch, 'r') as json_file:
        model = model_from_json(json_file.read())
    model.load_weights(f_model)
model.compile(loss='categorical_crossentropy', optimizer='adam')
Game.tournament(10000)

问题:

当我保存创建的权重然后决定重新运行整个过程时,我的权重会被完全覆盖还是每次运行都会改善模型?

在不完全重写代码的情况下,我可以使用任何统计技术来改善模型的性能吗?

0 个答案:

没有答案