Question

我正在尝试使用“温度”参数实现Caffe Softmax图层。我正在实施一个利用概述here的蒸馏技术的网络。

基本上，我希望我的Softmax层能够使用Softmax w / temperature函数，如下所示：

F(X) = exp(zi(X)/T) / sum(exp(zl(X)/T))

使用此功能，我希望能够在训练前调整温度T。我发现了一个类似的question，但是这个问题试图在部署网络上实现具有温度的Softmax。我正在努力在第一个答案中实现描述为“选项4”的额外Scale层。

我正在使用Caffe示例目录中的cifar10_full_train_test原型文件。我尝试过进行以下更改：

原始

...
...
...
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip1"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip1"
  bottom: "label"
  top: "loss"
}

修饰

...
...
...
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip1"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  type: "Scale"
  name: "temperature"
  top: "zi/T"
  bottom: "ip1"
  scale_param {
    filler: { type: 'constant' value: 0.025 } ### I wanted T = 40, so 1/40=.025
  }
  param { lr_mult: 0 decay_mult: 0 }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip1"
  bottom: "label"
  top: "loss"
}

经过快速训练（5000次迭代）之后，我检查了我的分类概率是否显得更均匀，但实际上它们的分布似乎不均匀。

示例：

高温T：F（X）= [0.2,0.5,0.1,0.2]

低温T：F（X）= [0.02,0.95,0.01,0.02]

〜我的尝试：F（X）= [0,1.0,0,0]

这项实施似乎是在正确的轨道上吗？无论哪种方式，我错过了什么？

Answer 1

您没有使用"zi/T"图层产生的“冷却”预测"Scale"。

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "zi/T"  # Use the "cooled" predictions instead of the originals.
  bottom: "label"
  top: "loss"
}

Answer 2

接受的答案帮助我理解了我对Softmax温度实施的误解。

正如@Shai指出的那样，为了观察＆＃34;冷却＆＃34;概率输出正如我所期望的那样，Scale层只能添加到＆＃34; deploy＆＃34;原型文件。根本不需要在train / val原型文本中包含Scale层。换句话说，温度必须应用于Softmax图层，而不是SoftmaxWithLoss图层。

如果您想应用＆＃34;冷却＆＃34;对你的概率向量产生影响，只需确保你的最后两层是这样的：

<强> deploy.prototxt

layer {
  type: "Scale"
  name: "temperature"
  top: "zi/T"
  bottom: "ip1"
  scale_param {
    filler: { type: 'constant' value: 1/T } ## Replace "1/T" with actual 1/T value
  }
  param { lr_mult: 0 decay_mult: 0 }
}
layer {
  name: "prob"
  type: "Softmax"
  bottom: "zi/T"
  top: "prob"
}

我的困惑主要是因为我误解了SoftmaxWithLoss和Softmax之间的区别。

Caffe：使用Scale层添加Softmax温度

2 个答案: