我正在尝试建立一个具有多个代理商的自定义强化学习环境,这些代理商拥有自己的研究项目政策网络,并且我一直处于培训阶段(试图对this example采取类似的方法)。 / p>
我的策略网络接受大小为21的数组作为输入,并从[-1,0,1]中输出一个元素。
我有以下代码(将多文件代码缩短为一个文件;对不起,一团糟):
clear
close all
%% Model parameters
T_init = 0;
T_final = 100;
dt = 1;
rng("shuffle")
baseEnv = baseEnvironment();
p1_pos = randi(baseEnv.L,1);
p2_pos = randi(baseEnv.L,1);
while p1_pos == p2_pos
p2_pos = randi(baseEnv.L,1);
end
rng("shuffle")
baseEnv = baseEnvironment();
% validateEnvironment(baseEnv)
p1_pos = randi(baseEnv.L,1);
p2_pos = randi(baseEnv.L,1);
while p1_pos == p2_pos
p2_pos = randi(baseEnv.L,1);
end
agent1 = IMAgent(baseEnv, p1_pos, 1, 'o');
agent2 = IMAgent(baseEnv, p2_pos, 2, 'x');
listOfAgents = [agent1; agent2];
multiAgentEnv = multiAgentEnvironment(listOfAgents);
%
actInfo = getActionInfo(baseEnv);
obsInfo = getObservationInfo(baseEnv);
%%build the agent1
actorNetwork = [imageInputLayer([obsInfo.Dimension(1) 1 1],'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(24,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(numel(actInfo.Elements),'Name','output')
softmaxLayer('Name','actionProb')];
actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,...
obsInfo,actInfo,'Observation','state',actorOpts);
actor = setLoss(actor, @actorLossFunction);
%obj.brain = rlPGAgent(actor,baseline,agentOpts);
agentOpts = rlPGAgentOptions('UseBaseline',false, 'DiscountFactor', 0.99);
agent1.brain = rlPGAgent(actor,agentOpts);
%%build the agent2
actorNetwork = [imageInputLayer([obsInfo.Dimension(1) 1 1],'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(24,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(numel(actInfo.Elements),'Name','output')
softmaxLayer('Name','actionProb')];
actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,...
obsInfo,actInfo,'Observation','state',actorOpts);
actor = setLoss(actor, @actorLossFunction);
%obj.brain = rlPGAgent(actor,baseline,agentOpts);
agentOpts = rlPGAgentOptions('UseBaseline',false, 'DiscountFactor', 0.99);
agent2.brain = rlPGAgent(actor,agentOpts);
%%
averageGrad = [];
averageSqGrad = [];
learnRate = 0.05;
gradDecay = 0.75;
sqGradDecay = 0.95;
numOfEpochs = 1;
numEpisodes = 5000;
maxStepsPerEpisode = 250;
discountFactor = 0.995;
aveWindowSize = 100;
trainingTerminationValue = 220;
loss_history = [];
for i = 1:numOfEpochs
action_hist = [];
reward_hist = [];
observation_hist = [multiAgentEnv.baseEnv.state];
for t = T_init:1:T_final
actionList = multiAgentEnv.act();
[observation, reward, multiAgentEnv.isDone, ~] = multiAgentEnv.step(actionList);
if t == T_final
multiAgentEnv.isDone = true;
end
action_hist = cat(3, action_hist, actionList);
reward_hist = cat(3, reward_hist, reward);
if multiAgentEnv.isDone == true
break
else
observation_hist = cat(3, observation_hist, observation);
end
end
if size(observation_hist,3) ~= size(action_hist,3)
print("gi")
end
clear observation reward
actor = getActor(agent1.brain);
batchSize = min(t,maxStepsPerEpisode);
observations = observation_hist;
actions = action_hist(1,:,:);
rewards = reward_hist(1,:,:);
observationBatch = permute(observations(:,:,1:batchSize), [2,1,3]);
actionBatch = actions(:,:,1:batchSize);
rewardBatch = rewards(:,1:batchSize);
discountedReturn = zeros(1,int32(batchSize));
for t = 1:batchSize
G = 0;
for k = t:batchSize
G = G + discountFactor ^ (k-t) * rewardBatch(k);
end
discountedReturn(t) = G;
end
lossData.batchSize = batchSize;
lossData.actInfo = actInfo;
lossData.actionBatch = actionBatch;
lossData.discountedReturn = discountedReturn;
% 6. Compute the gradient of the loss with respect to the policy
% parameters.
actorGradient = gradient(actor,'loss-parameters', {observationBatch},lossData);
p1_pos = randi(baseEnv.L,1);
p2_pos = randi(baseEnv.L,1);
while p1_pos == p2_pos
p2_pos = randi(baseEnv.L,1);
end
multiAgentEnv.reset([p1_pos; p2_pos]);
end
function loss = actorLossFunction(policy, lossData)
% Create the action indication matrix.
batchSize = lossData.batchSize;
Z = repmat(lossData.actInfo.Elements',1,batchSize);
actionIndicationMatrix = lossData.actionBatch(:,:) == Z;
% Resize the discounted return to the size of policy.
G = actionIndicationMatrix .* lossData.discountedReturn;
G = reshape(G,size(policy));
% Round any policy values less than eps to eps.
policy(policy < eps) = eps;
% Compute the loss.
loss = -sum(G .* log(policy),'all');
end
运行代码时,出现以下错误:
使用rl.representation.rlAbstractRepresentation / gradient的错误(行 181)无法从表示形式计算梯度。
main1中的错误(第164行) actorGradient =梯度(actor,“损失参数”,{observationBatch},lossData);
原因: 无法评估损失函数。检查损失功能并确保其成功运行。 引用不存在的“优势”字段。
我也尝试在链接中运行示例;它有效,但是我的代码无效。我设置了一个断点损失函数,但在梯度计算过程中未调用该断点,并且从错误消息中我怀疑这是问题所在,但当我在mathworks网站上运行示例代码时,事情就可以了