Question

我试图了解dlib中的loss_metric类如何计算渐变。这是代码（full version）：

// It should be noted that the derivative of length(x-y) with respect
// to the x vector is the unit vector (x-y)/length(x-y).  If you stare
// at the code below long enough you will see that it's just an
// application of this formula.
if (x_label == y_label)
{
    // Things with the same label should have distances < dist_thresh between
    // them.  If not then we experience non-zero loss.
    if (d2 < dist_thresh-margin)//d2 - distance between x and y samples.
    {
        gm[r*temp.num_samples() + c] = 0;
    }
    else
    {
   // The whole objective function is multiplied by this to scale the loss
   // relative to the number of things in the mini-batch.
   // scale = 0.5/num_pos_samps;
        loss += scale*(d2 - (dist_thresh-margin));
        //r - x sample index, c - y sample index
        gm[r*temp.num_samples() + r] += scale/d2;
        gm[r*temp.num_samples() + c] = -scale/d2;
    }
}
else
{
    // Things with different labels should have distances > dist_thresh between
    // them.  If not then we experience non-zero loss.
    if (d2 > dist_thresh+margin || d2 > neg_thresh)
    {
        gm[r*temp.num_samples() + c] = 0;
    }
    else
    {
        loss += scale*((dist_thresh+margin) - d2);
        // don't divide by zero (or a really small number)
        d2 = std::max(d2, 0.001f);
        gm[r*temp.num_samples() + r] -= scale/d2;
        gm[r*temp.num_samples() + c] = scale/d2;
    }
}

//...
// gemm - matrix multiplication
// grad - final gradient
// grad_mul - gm
// output_tensor - output tensor of the last layer
tt::gemm(0, grad, 1, grad_mul, false, output_tensor, false);

让我们看看相同类别（1030行）的损失，我认为scale*(d2 - (dist_thresh-margin))的梯度应该等于C1*(||X1 - X2|| - (C2-C3))的梯度，其中CN - 常数和XN - 输出向量，因此对于X1，它们的渐变应为= C1，对于X2，它们应为-C1，但不是这样，我们在第1031,1032,1056行进行另一次计算。

不同类的渐变相同（从1048行开始）。

不幸的是，即使是第一条评论中的暗示也没有让它更清晰。我没有足够的经验来理解这一点，但我认为一些稍微有经验的人可以证明我犯了错误。

那么，这里使用的是什么样的渐变公式？我们是如何得到它的？

对梯度计算的误解

0 个答案: