Why do we need to call zero_grad() in PyTorch?

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient … Read more

How to interpret caffe log with debug_info?

At first glance you can see this log section divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation: A training example (batch) is fed to the net and a forward pass outputs the current prediction. Based on this prediction a loss is computed. The loss is then derived, … Read more

Common causes of nans during training of neural networks

I came across this phenomenon several times. Here are my observations: Gradient blow up Reason: large gradients throw the learning process off-track. What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You’ll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss … Read more