Per this case, the activation function does not depend durante scores of other classes sopra \(C\) more than \(C_1 = C_i\). So the gradient respect esatto the each conteggio \(s_i\) per \(s\) will only depend on the loss given by its binary problem.
- Caffe: Sigmoid Ciclocampestre-Entropy Loss Layer
- Pytorch: BCEWithLogitsLoss
- TensorFlow: sigmoid_cross_entropy.
, from Facebook, con this paper. They claim esatto improve one-tirocinio object detectors using Focal Loss preciso train verso detector they name RetinaNet. Focal loss is verso Ciclocross-Entropy Loss that weighs the contribution of each sample preciso the loss based per the classification error. The idea is that, if verso sample is already classified correctly by the CNN, its contribution sicuro the loss decreases. With this strategy, they claim puro solve the problem of class imbalance by making the loss implicitly focus durante those problematic classes. Moreover, they also weight the contribution of each class puro the lose in verso more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered a Binary Ciclocross-Entropy Loss. We define it for each binary problem as:
Where \((1 – s_i)\gamma\), with the focusing parameter \(\varieta >= 0\), is per modulating factor onesto reduce the influence of correctly classified samples sopra the loss. With \(\tipo = 0\), Focal Loss is equivalent preciso Binary Cross Entropy Loss.
Where we have separated formulation for when the class \(C_i = C_1\) is positive or negative (and therefore, tagliandi faceflow the class \(C_2\) is positive). As before, we have \(s_2 = 1 – s_1\) and \(t2 = 1 – t_1\).
The gradient gets a bit more complex paio preciso the inclusion of the modulating factor \((1 – s_i)\gamma\) mediante the loss formulation, but it can be deduced using the Binary Ciclocross-Entropy gradient expression.
Where \(f()\) is the sigmoid function. Sicuro get the gradient expression for verso negative \(C_i (t_i = 0\)), we just need preciso replace \(f(s_i)\) with \((1 – f(s_i))\) con the expression above.
Notice that, if the modulating factor \(\qualita = 0\), the loss is equivalent to the CE Loss, and we end up with the same gradient expression.
Forward pass: Loss computation
Where logprobs[r] stores, per each element of the batch, the sum of the binary cross entropy per each class. The focusing_parameter is \(\gamma\), which by default is 2 and should be defined as verso layer parameter in the net prototxt. The class_balances can be used sicuro introduce different loss contributions a class, as they do con the Facebook paper.
Backward pass: Gradients computation
Sopra the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class \(C_p\) keeps its term con the loss. There is only one element of the Target vector \(t\) which is not nulla \(t_i = t_p\). So discarding the elements of the summation which are nulla due to target labels, we can write:
This would be the pipeline for each one of the \(C\) clases. We arnesi \(C\) independent binary classification problems \((C’ = 2)\). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem sicuro backpropagate, and the losses onesto schermo the global loss. \(s_1\) and \(t_1\) are the conteggio and the gorundtruth label for the class \(C_1\), which is also the class \(C_i\) con \(C\). \(s_2 = 1 – s_1\) and \(t_2 = 1 – t_1\) are the conteggio and the groundtruth label of the class \(C_2\), which is not verso “class” in our original problem with \(C\) classes, but per class we create onesto batteria up the binary problem with \(C_1 = C_i\). We can understand it as verso sostrato class.