为了简化问题,比如当尺寸(或特征)已经更新n次时,下次看到该特征时,我想将学习速率设置为1 / n。
我想出了这些代码:
def test_adagrad():
embedding = theano.shared(value=np.random.randn(20,10), borrow=True)
times = theano.shared(value=np.ones((20,1)))
lr = T.dscalar()
index_a = T.lvector()
hist = times[index_a]
cost = T.sum(theano.sparse_grad(embedding[index_a]))
gradients = T.grad(cost, embedding)
updates = [(embedding, embedding+lr*(1.0/hist)*gradients)]
### Here should be some codes to update also times which are omitted ###
train = theano.function(inputs=[index_a, lr],outputs=cost,updates=updates)
for i in range(10):
print train([1,2,3],0.05)
Theano没有给出任何错误,但训练结果有时会给Nan。有人知道如何解决这个问题吗?
感谢您的帮助
PS:我怀疑是稀疏空间中的操作会产生问题。所以我试图用theano.sparse.mul替换*。这给出了我之前提到的一些结果
答案 0 :(得分:8)
也许您可以使用以下example for implementation of adadelta,并使用它来推导自己的{{3}}。如果成功请更新: - )
答案 1 :(得分:1)
我一直在寻找相同的东西,并最终以zuuz已经指出的资源风格自己实现它。所以这可能有助于任何寻求帮助的人。
def adagrad(lr, tparams, grads, inp, cost):
# stores the current grads
gshared = [theano.shared(np.zeros_like(p.get_value(),
dtype=theano.config.floatX),
name='%s_grad' % k)
for k, p in tparams.iteritems()]
grads_updates = zip(gshared, grads)
# stores the sum of all grads squared
hist_gshared = [theano.shared(np.zeros_like(p.get_value(),
dtype=theano.config.floatX),
name='%s_grad' % k)
for k, p in tparams.iteritems()]
rgrads_updates = [(rg, rg + T.sqr(g)) for rg, g in zip(hist_gshared, grads)]
# calculate cost and store grads
f_grad_shared = theano.function(inp, cost,
updates=grads_updates + rgrads_updates,
on_unused_input='ignore')
# apply actual update with the initial learning rate lr
n = 1e-6
updates = [(p, p - (lr/(T.sqrt(rg) + n))*g)
for p, g, rg in zip(tparams.values(), gshared, hist_gshared)]
f_update = theano.function([lr], [], updates=updates, on_unused_input='ignore')
return f_grad_shared, f_update
答案 2 :(得分:1)
我发现this implementation from Lasagne非常简洁易读。你几乎可以使用它:
for param, grad in zip(params, grads):
value = param.get_value(borrow=True)
accu = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
accu_new = accu + grad ** 2
updates[accu] = accu_new
updates[param] = param - (learning_rate * grad /
T.sqrt(accu_new + epsilon))