我一直在尝试实现here所描述的算法,然后在同一篇论文中描述的“大动作任务”上进行测试。
算法概述:
简而言之,该算法使用下面所示形式的RBM通过改变其权重来解决强化学习问题,使得网络配置的自由能等于为该状态动作对给出的奖励信号。
要选择一个动作,算法会在保持状态变量固定的同时执行gibbs采样。有足够的时间,这会产生具有最低自由能的动作,因此是给定状态的最高奖励。
大型行动任务概述:
作者实施指南概述:
受限制的Boltzmann机器具有13个隐藏变量,在实例化时进行了训练 具有12位状态空间和40位动作空间的大型动作任务。十三个关键州是 随机选择。该网络运行了12000次,学习率从0.1开始 在训练过程中,指数值为0.01,温度从1.0到0.1。每 迭代用随机状态初始化。每个动作选择包括100次迭代 吉布斯抽样。
重要的遗漏细节:
我的实施:
我最初假设作者没有使用指南中描述的机制以外的其他机制,因此我尝试在没有偏置单元的情况下训练网络。这导致了近乎机会的表现,这是我的第一个线索,即所使用的某些机制必须被作者视为“显而易见”,因此被省略。
我使用上面提到的各种省略机制,并使用以下方法获得了最佳结果:
但即使进行了所有这些修改,我在任务上的表现通常在12000次迭代后平均奖励28次。
每次迭代的代码:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
我要求的是什么:
所以,如果你们中的任何一个人能够使这个算法正常工作(作者声称在12000次迭代后平均得到~40个奖励),我将非常感激。
如果我的代码似乎做了明显错误的事情,那么提请注意那也是一个很好的答案。
我希望作者遗漏的内容对于那些比我自己有更多基于能量的学习经验的人来说确实显而易见,在这种情况下,只需指出需要包含在工作实施中的内容。
答案 0 :(得分:1)