如何在R中实现q-learning?

时间:2016-09-06 16:17:33

标签: r q-learning

我正在学习q-learning并找到了一篇维基百科帖子和website

根据教程和伪代码,我在R

中写了这么多
#q-learning example
#http://mnemstudio.org/path-finding-q-learning-tutorial.htm
#https://en.wikipedia.org/wiki/Q-learning

set.seed(2016)
iter=100
dimension=5;
alpha=0.1 #learning rate
gamma=0.8 #exploration/ discount factor

# n x n matrix
Q=matrix( rep( 0, len=dimension*dimension), nrow = dimension)
Q

# R -1 is fire pit,0 safe path and 100 Goal state########
R=matrix( sample( -1:0, dimension*dimension,replace=T,prob=c(1,2)), nrow = dimension)
R[dimension,dimension]=100
R #reward matrix
################

for(i in 1:iter){
  row=sample(1:dimension,1)
  col=sample(1:dimension,1)
  I=Q[row,col] #randomly choosing initial state

  Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
  #equation from wikipedia

}

但我在max(Qdash-Q[row,col]中遇到问题,根据该网站Max[Q(next state, all actions)]如何以编程方式搜索下一个州的所有操作?

第二个问题是这个伪代码

Do While the goal state hasn't been reached.

Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Set the next state as the current state.
End Do

是这个吗?

while(Q<100){
Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
}

1 个答案:

答案 0 :(得分:4)

这篇文章绝不是R中Q学习的完整实现。它试图回答关于在帖子和维基百科中链接的网站中的算法描述的OP。

这里的假设是奖励矩阵R如网站所述。即它将可能的动作的奖励值编码为非负数,并且矩阵中的-1表示空值(即,没有可能的动作转换到该状态)。

通过此设置,Q更新的R实现是:

Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])

,其中

  1. cs是路径中当前点的当前状态。
  2. ns是基于当前状态下(随机)选择的操作的新状态。该动作是从当前状态下的可能动作的集合中选择的(即,R[cs,] > -1)。由于状态转换本身在这里是确定性的,因此行动是向新状态的过渡。
  3. 对于导致ns的此操作,我们希望将其最大(未来)值添加到可在ns处执行的所有可能操作。这是链接网站中的所谓Max[Q(next state, all actions)]术语和维基百科中的“最佳未来价值估计”。为了计算这一点,我们希望最大化ns QQ,但只考虑R的列ns对应R[ns,] > -1列第 - 行是有效的操作(即,max(Q[ns, which(R[ns,] > -1)]) )。因此,这是:

    alpha

    对此值的解释是一步预测值或动态规划中的成本估算。

  4. 链接网站中的等式是1,学习率为Q[cs,ns] <- (1-alpha)*Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)])) 的特殊情况。我们可以在维基百科中查看等式:

    alpha

    其中Q[cs,ns]“插入”旧值R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)])和学习值alpha=1之间。如维基百科所述,

      

    在完全确定性环境中,学习率q.learn <- function(R, N, alpha, gamma, tgt.state) { ## initialize Q to be zero matrix, same size as R Q <- matrix(rep(0,length(R)), nrow=nrow(R)) ## loop over episodes for (i in 1:N) { ## for each episode, choose an initial state at random cs <- sample(1:nrow(R), 1) ## iterate until we get to the tgt.state while (1) { ## choose next state from possible actions at current state ## Note: if only one possible action, then choose it; ## otherwise, choose one at random next.states <- which(R[cs,] > -1) if (length(next.states)==1) ns <- next.states else ns <- sample(next.states,1) ## this is the update Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns]) ## break out of while loop if target state is reached ## otherwise, set next.state as current.state and repeat if (ns == tgt.state) break cs <- ns } } ## return resulting Q normalized by max value return(100*Q/max(Q)) } 是最佳的

  5. 将所有内容整合到一个函数中:

    R

    输入参数为:

    1. N是博客中定义的奖励矩阵
    2. alpha是要迭代的剧集数
    3. gamma是学习率
    4. tgt.state是折扣系数
    5. N <- 1000 alpha <- 1 gamma <- 0.8 tgt.state <- 6 R <- matrix(c(-1,-1,-1,-1,0,-1,-1,-1,-1,0,-1,0,-1,-1,-1,0,-1,-1,-1,0,0,-1,0,-1,0,-1,-1,0,-1,0,-1,100,-1,-1,100,100),nrow=6) print(R) ## [,1] [,2] [,3] [,4] [,5] [,6] ##[1,] -1 -1 -1 -1 0 -1 ##[2,] -1 -1 -1 0 -1 100 ##[3,] -1 -1 -1 0 -1 -1 ##[4,] -1 0 0 -1 0 -1 ##[5,] 0 -1 -1 0 -1 100 ##[6,] -1 0 -1 -1 0 100 Q <- q.learn(R,iter,alpha,gamma,tgt.state) print(Q) ## [,1] [,2] [,3] [,4] [,5] [,6] ##[1,] 0 0 0.0 0 80 0.00000 ##[2,] 0 0 0.0 64 0 100.00000 ##[3,] 0 0 0.0 64 0 0.00000 ##[4,] 0 80 51.2 0 80 0.00000 ##[5,] 64 0 0.0 64 0 100.00000 ##[6,] 0 80 0.0 0 80 99.99994 是问题的目标状态。
    6. 使用链接网站中的示例作为测试:

      df = pd.DataFrame.from_records(samples)