我正在尝试从Lasagne / Theano的DEEP REINFORCEMENT LEARNING IN PARAMETERIZED ACTION SPACE(等式11)中编写反转渐变方法的示例。基本上我要做的是确保网络的输出在某些指定的范围内,在这种情况下为[1,-1]。
我一直在看here that inverts the gradient给出的例子,但是在这一点上我被困住了。我认为执行此操作的最佳位置是渐变计算方法,因此我复制了rmsprop并尝试在应用更新之前编辑渐变。
这是我到目前为止所拥有的
def rmspropWithInvert(loss_or_grads, params, p, learning_rate=1.0, rho=0.9, epsilon=1e-6):
clip = 2.0
grads = lasagne.updates.get_or_compute_grads(loss_or_grads, params)
# grads = theano.gradient.grad_clip(grads, -clip, clip)
grads_ = []
for grad in grads:
grads_.append(theano.gradient.grad_clip(grad, -clip, clip) )
grads = grads_
a, p_ = T.scalars('a', 'p_')
z_lazy = ifelse(T.gt(a,0.0), (1.0-p_)/(2.0), (p_-(-1.0))/(2.0))
f_lazyifelse = theano.function([a,p_], z_lazy,
mode=theano.Mode(linker='vm'))
# compute the parameter vector to invert the gradients by
ps = theano.shared(
np.zeros((3, 1), dtype=theano.config.floatX),
broadcastable=(False, True))
for i in range(3):
ps[i] = f_lazyifelse(grads[-1][i], p[i])
# Apply vector through computed gradients
grads2=[]
for grad in grads.reverse():
grads2.append(theano.mul(ps, grad))
ps = grad
grads = grads2.reverse()
print "Grad Update: " + str(grads[0])
updates = OrderedDict()
# Using theano constant to prevent upcasting of float32
one = T.constant(1)
for param, grad in zip(params, grads):
value = param.get_value(borrow=True)
accu = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
accu_new = rho * accu + (one - rho) * grad ** 2
updates[accu] = accu_new
updates[param] = param - (learning_rate * grad /
T.sqrt(accu_new + epsilon))
return updates
也许对Theano / Lasagne更熟练的人会看到解决方案?从概念上讲,我认为计算很简单,但在更新步骤中编码所有内容都象征性地证明对我来说具有挑战性。我仍然习惯了Theano。