How to know if underfitting or overfitting is occuring?

[*]
[**] What is overfitting

Overfitting ( or underfitting) occurs when a model is too specific (or not specific enough) to the training data, and doesn’t extrapolate well to the true domain. I’ll just say overfitting from now on to save my poor typing fingers [*]

I think the wikipedia image is good:

wikipedia overfitting curve

Clearly, the green line, a decision boundary trying to separate the red class from the blue, is “overfit”, because although it will do well on the training data, it lacks the “regularized” form we like to see when generalizing [**].

These CMU slides on overfitting/cross validation also make the problem clear:

enter image description here

And here’s some more intuition for good measure


[**]When does overfitting occur, generally?

Overfitting is observed numerically when the testing error does not reflect the training error

Obviously, the testing error will always (in expectation) be worse than the training error, but at a certain number of iterations, the loss in testing will start to increase, even as the loss in training continues to decline.


[**]How to tell when a model has overfit visually?

Overfitting can be observed by plotting the decision boundary (as in
the wikipedia image above) when dimensionality allows, or by looking
at testing loss in addition to training loss during the fit procedure

You don’t give us enough points to make these graphs, but here’s an example (from someone asking a similar question) showing what those loss graphs would look like:
Overfit loss curves

While loss curves are sometimes more pretty and logarthmic, note the trend here that training error is still decreasing but testing error is on the rise. That’s a big red flag for overfitting. SO discusses loss curves here

The slightly cleaner and more real-life example is from this CMU lecture on ovefitting ANN’s:

Ovefitting second example

The top graph is overfitting, as before. The bottom graph is not.


When does this occur?

When a model has too many parameters, it is susceptible to overfitting (like a n-degree polynomial to n-1 points). Likewise, a model with not enough parameters can be underfit.

Certain regularization techniques like dropout or batch normalization, or traditionally l-1 regularization combat this. I believe this is beyond the scope of your question.

Further reading:

  1. A good statistics-SO question and answers
  2. Dense reading: bounds on overfitting with some models
  3. Lighter reading: general overview
  4. The related bias-variance tradeoff

Footnotes

[*] There’s no reason to keep writing “overfitting/underfitting”, since the reasoning is the same for both, but the indicators are flipped, obviously (a decision boundary that hasn’t latched onto the true border enough, as opposed to being too tightly wrapped against individual points). In general, overfitting is the more common to avoid, since “more iterations/more parameters” is the current theme. If you have lots of data and not lot of parameters, maybe you really are worried about underfitting, but I doubt it.

[**] One way to formalize the idea that the black line is preferable than the green one in the first image from wikipedia is to penalize the number of parameters required by your model during model selection

Leave a Comment