Order between using validation, training and test sets

The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML.

There are two separate ways of approaching the problem:

  • Either you use an explicit validation set to do hyperparameter search & tuning
  • Or you use cross-validation

So, the standard point is that you always put aside a portion of your data as test set; this is used for no other reason than assessing the performance of your model in the end (i.e. not back-and-forth and multiple assessments, because in that case you are using your test set as a validation set, which is bad practice).

After you have done that, you choose if you will cut another portion of your remaining data to use as a separate validation set, or if you will proceed with cross-validation (in which case, no separate and fixed validation set is required).

So, essentially, both your first and third approaches are valid (and mutually exclusive, i.e. you should choose which one you will go with). The second one, as you describe it (CV only in the validation set?), is certainly not (as said, when you choose to go with CV you don’t assign a separate validation set). Apart from a brief mention of cross-validation, what the Wikipedia article actually describes is your first approach.

Questions of which approach is “better” cannot of course be answered at that level of generality; both approaches are indeed valid, and are used depending on the circumstances. Very loosely speaking, I would say that in most “traditional” (i.e. non deep learning) ML settings, most people choose to go with cross-validation; but there are cases where this is not practical (most deep learning settings, again loosely speaking), and people are going with a separate validation set instead.

Leave a Comment