Question

我正在为强化学习任务（http://www.incompleteideas.net/book/ebook/node17.html）实施Softmax行动选择政策。

我带来了这个解决方案，但我认为还有改进的余地。

1 - 我在这里评估概率

    prob_t = [0]*3
    denominator = 0
    for a in range(nActions):
        denominator += exp(Q[state][a] / temperature) 

    for a in range(nActions):
        prob_t[a] = (exp(Q[state][a]/temperature))/denominator

2 - 我在这里比较范围内随机生成的数字] 0,1 [与行动的概率值：

    rand_action = random.random()
    if rand_action < prob_t[0]:
        action = 0      
    elif rand_action >= prob_t[0] and rand_action < prob_t[1]+prob_t[0]:
        action = 1      
    else: #if rand_action >= prob_t[1]+prob_t[0]
        action = 2

编辑：

示例：rand_action为0.78，prob_t [0]为0.25，prob_t [1]为0.35，prob_t [2]为0.4。概率总和为1。 0.78大于动作0和1的概率之和（prob_t [0] + prob_t [1]）因此选择了动作2.

有更有效的方法吗？

Answer 1

在评估每个动作的概率后，如果你有一个函数来返回加权随机选择，你可以得到你想要的动作：

action = weighted_choice(prob_t)

虽然我不确定这是否是你所说的＆＃34;更好的方式＆＃34;。

weighted_choice可以是this：

import random
def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i

如果您有很多可用的操作，请务必检查文章中的二进制搜索实现，而不是上面的线性搜索。

或者，如果您有权访问numpy：

import numpy as np
def weighted_choice(weights):
    totals = np.cumsum(weights)
    norm = totals[-1]
    throw = np.random.rand()*norm
    return np.searchsorted(totals, throw)

Answer 2

使用numpy库可以轻松地基于概率选择动作。

q_values = [] #array of q_values
action = np.random.choice(q_values,p=q_values)

Answer 3

在使用numpy的建议后，我做了一些研究，并为soft-max实现的第一部分提供了这个解决方案。

prob_t = [0,0,0]       #initialise
for a in range(nActions):
    prob_t[a] = np.exp(Q[state][a]/temperature)  #calculate numerators

#numpy matrix element-wise division for denominator (sum of numerators)
prob_t = np.true_divide(prob_t,sum(prob_t))

有一个for循环少于我的初始解决方案。我唯一能感受到的缺点是精度降低。

使用numpy：

[ 0.02645082  0.02645082  0.94709836]

初始双循环解决方案：

[0.02645082063629476, 0.02645082063629476, 0.9470983587274104]

有没有比这更好的方法来实施Softmax Action Selection for Reinforcement Learning？

3 个答案: