Why don’t you use the InfogainLoss layer to compensate for the imbalance in your training set?
The Infogain loss is defined using a weight matrix H
(in your case 2-by-2) The meaning of its entries are
[cost of predicting 1 when gt is 0, cost of predicting 0 when gt is 0
cost of predicting 1 when gt is 1, cost of predicting 0 when gt is 1]
So, you can set the entries of H
to reflect the difference between errors in predicting 0 or 1.
You can find how to define matrix H
for caffe in this thread.
Regarding sample weights, you may find this post interesting: it shows how to modify the SoftmaxWithLoss layer to take into account sample weights.
Recently, a modification to cross-entropy loss was proposed by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár Focal Loss for Dense Object Detection, (ICCV 2017).
The idea behind focal-loss is to assign different weight for each example based on the relative difficulty of predicting this example (rather based on class size etc.). From the brief time I got to experiment with this loss, it feels superior to "InfogainLoss"
with class-size weights.