我试图了解dlib中的loss_metric类如何计算渐变。这是代码(full version):
// It should be noted that the derivative of length(x-y) with respect
// to the x vector is the unit vector (x-y)/length(x-y). If you stare
// at the code below long enough you will see that it's just an
// application of this formula.
if (x_label == y_label)
{
// Things with the same label should have distances < dist_thresh between
// them. If not then we experience non-zero loss.
if (d2 < dist_thresh-margin)//d2 - distance between x and y samples.
{
gm[r*temp.num_samples() + c] = 0;
}
else
{
// The whole objective function is multiplied by this to scale the loss
// relative to the number of things in the mini-batch.
// scale = 0.5/num_pos_samps;
loss += scale*(d2 - (dist_thresh-margin));
//r - x sample index, c - y sample index
gm[r*temp.num_samples() + r] += scale/d2;
gm[r*temp.num_samples() + c] = -scale/d2;
}
}
else
{
// Things with different labels should have distances > dist_thresh between
// them. If not then we experience non-zero loss.
if (d2 > dist_thresh+margin || d2 > neg_thresh)
{
gm[r*temp.num_samples() + c] = 0;
}
else
{
loss += scale*((dist_thresh+margin) - d2);
// don't divide by zero (or a really small number)
d2 = std::max(d2, 0.001f);
gm[r*temp.num_samples() + r] -= scale/d2;
gm[r*temp.num_samples() + c] = scale/d2;
}
}
//...
// gemm - matrix multiplication
// grad - final gradient
// grad_mul - gm
// output_tensor - output tensor of the last layer
tt::gemm(0, grad, 1, grad_mul, false, output_tensor, false);
让我们看看相同类别(1030行)的损失,我认为scale*(d2 - (dist_thresh-margin))
的梯度应该等于C1*(||X1 - X2|| - (C2-C3))
的梯度,其中CN - 常数和XN - 输出向量,因此对于X1,它们的渐变应为= C1
,对于X2,它们应为-C1
,但不是这样,我们在第1031,1032,1056行进行另一次计算。
不同类的渐变相同(从1048行开始)。
不幸的是,即使是第一条评论中的暗示也没有让它更清晰。我没有足够的经验来理解这一点,但我认为一些稍微有经验的人可以证明我犯了错误。
那么,这里使用的是什么样的渐变公式?我们是如何得到它的?