RL训练:梯度函数无法评估损失函数

时间:2020-07-05 16:23:34

标签: matlab reinforcement-learning agent

我正在尝试建立一个具有多个代理商的自定义强化学习环境,这些代理商拥有自己的研究项目政策网络,并且我一直处于培训阶段(试图对this example采取类似的方法)。 / p>

我的策略网络接受大小为21的数组作为输入,并从[-1,0,1]中输出一个元素。

我有以下代码(将多文件代码缩短为一个文件;对不起,一团糟):

clear
close all

%% Model parameters
T_init = 0;
T_final = 100;
dt = 1;

rng("shuffle")

baseEnv = baseEnvironment();
p1_pos = randi(baseEnv.L,1);
p2_pos = randi(baseEnv.L,1);
while p1_pos == p2_pos
    p2_pos = randi(baseEnv.L,1);
end

rng("shuffle")

baseEnv = baseEnvironment();
% validateEnvironment(baseEnv)
p1_pos = randi(baseEnv.L,1);
p2_pos = randi(baseEnv.L,1);
while p1_pos == p2_pos
    p2_pos = randi(baseEnv.L,1);
end

agent1 = IMAgent(baseEnv, p1_pos, 1, 'o');
agent2 = IMAgent(baseEnv, p2_pos, 2, 'x');
listOfAgents = [agent1; agent2];
multiAgentEnv = multiAgentEnvironment(listOfAgents);

%
actInfo = getActionInfo(baseEnv);
obsInfo = getObservationInfo(baseEnv);

%%build the agent1
actorNetwork = [imageInputLayer([obsInfo.Dimension(1) 1 1],'Normalization','none','Name','state')
                fullyConnectedLayer(24,'Name','fc1')
                reluLayer('Name','relu1')
                fullyConnectedLayer(24,'Name','fc2')
                reluLayer('Name','relu2')
                fullyConnectedLayer(numel(actInfo.Elements),'Name','output')
                softmaxLayer('Name','actionProb')];
actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,...
    obsInfo,actInfo,'Observation','state',actorOpts);
actor = setLoss(actor, @actorLossFunction);
%obj.brain = rlPGAgent(actor,baseline,agentOpts);
agentOpts = rlPGAgentOptions('UseBaseline',false, 'DiscountFactor', 0.99);
agent1.brain = rlPGAgent(actor,agentOpts);
%%build the agent2
actorNetwork = [imageInputLayer([obsInfo.Dimension(1) 1 1],'Normalization','none','Name','state')
                fullyConnectedLayer(24,'Name','fc1')
                reluLayer('Name','relu1')
                fullyConnectedLayer(24,'Name','fc2')
                reluLayer('Name','relu2')
                fullyConnectedLayer(numel(actInfo.Elements),'Name','output')
                softmaxLayer('Name','actionProb')];
actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,...
    obsInfo,actInfo,'Observation','state',actorOpts);
actor = setLoss(actor, @actorLossFunction);
%obj.brain = rlPGAgent(actor,baseline,agentOpts);
agentOpts = rlPGAgentOptions('UseBaseline',false, 'DiscountFactor', 0.99);
agent2.brain = rlPGAgent(actor,agentOpts);
%%

averageGrad = [];
averageSqGrad = [];
learnRate = 0.05;
gradDecay = 0.75;
sqGradDecay = 0.95;
numOfEpochs = 1;

numEpisodes = 5000;
maxStepsPerEpisode = 250;
discountFactor = 0.995;
aveWindowSize = 100;
trainingTerminationValue = 220;



loss_history = [];
for i = 1:numOfEpochs
    action_hist = [];
    reward_hist = [];
    observation_hist = [multiAgentEnv.baseEnv.state];
    for t = T_init:1:T_final
        actionList = multiAgentEnv.act();
        [observation, reward, multiAgentEnv.isDone, ~] = multiAgentEnv.step(actionList);

        if t == T_final
            multiAgentEnv.isDone = true;
        end
        
        action_hist = cat(3, action_hist, actionList);
        reward_hist = cat(3, reward_hist, reward);
        if multiAgentEnv.isDone == true
            break
        else
            observation_hist = cat(3, observation_hist, observation);
        end
    end
    if size(observation_hist,3) ~= size(action_hist,3)
        print("gi")
    end
    clear observation reward
    actor = getActor(agent1.brain);        
    batchSize = min(t,maxStepsPerEpisode);

    observations = observation_hist;
    actions = action_hist(1,:,:);
    rewards = reward_hist(1,:,:);
    
    observationBatch = permute(observations(:,:,1:batchSize), [2,1,3]);
    actionBatch = actions(:,:,1:batchSize);
    rewardBatch = rewards(:,1:batchSize);
    
    
    discountedReturn = zeros(1,int32(batchSize));
    for t = 1:batchSize
        G = 0;
        for k = t:batchSize
            G = G + discountFactor ^ (k-t) * rewardBatch(k);
        end
        discountedReturn(t) = G;
    end
    
    lossData.batchSize = batchSize;
    lossData.actInfo = actInfo;
    lossData.actionBatch = actionBatch;
    lossData.discountedReturn = discountedReturn;
    
    % 6. Compute the gradient of the loss with respect to the policy
    % parameters.
    actorGradient = gradient(actor,'loss-parameters', {observationBatch},lossData);
    
    
    p1_pos = randi(baseEnv.L,1);
    p2_pos = randi(baseEnv.L,1);
    while p1_pos == p2_pos
        p2_pos = randi(baseEnv.L,1);
    end
    multiAgentEnv.reset([p1_pos; p2_pos]);
end


function loss = actorLossFunction(policy, lossData)

    % Create the action indication matrix.
    batchSize = lossData.batchSize;
    Z = repmat(lossData.actInfo.Elements',1,batchSize);
    actionIndicationMatrix = lossData.actionBatch(:,:) == Z;

    % Resize the discounted return to the size of policy.
    G = actionIndicationMatrix .* lossData.discountedReturn;
    G = reshape(G,size(policy));

    % Round any policy values less than eps to eps.
    policy(policy < eps) = eps;

    % Compute the loss.
    loss = -sum(G .* log(policy),'all');
end

运行代码时,出现以下错误:

使用rl.representation.rlAbstractRepresentation / gradient的错误(行 181)无法从表示形式计算梯度。

main1中的错误(第164行) actorGradient =梯度(actor,“损失参数”,{observationBatch},lossData);

原因: 无法评估损失函数。检查损失功能并确保其成功运行。 引用不存在的“优势”字段。

我也尝试在链接中运行示例;它有效,但是我的代码无效。我设置了一个断点损失函数,但在梯度计算过程中未调用该断点,并且从错误消息中我怀疑这是问题所在,但当我在mathworks网站上运行示例代码时,事情就可以了

0 个答案:

没有答案