multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer? [closed]

how many hidden layers?

a model with zero hidden layers will resolve linearly separable data. So unless you already know your data isn’t linearly separable, it doesn’t hurt to verify this–why use a more complex model than the task requires? If it is linearly separable then a simpler technique will work, but a Perceptron will do the job as well.

Assuming your data does require separation by a non-linear technique, then always start with one hidden layer. Almost certainly that’s all you will need. If your data is separable using a MLP, then that MLP probably only needs a single hidden layer. There is theoretical justification for this, but my reason is purely empirical: Many difficult classification/regression problems are solved using single-hidden-layer MLPs, yet I don’t recall encountering any multiple-hidden-layer MLPs used to successfully model data–whether on ML bulletin boards, ML Textbooks, academic papers, etc. They exist, certainly, but the circumstances that justify their use is empirically quite rare.

How many nodes in the hidden layer?

From the MLP academic literature. my own experience, etc., I have gathered and often rely upon several rules of thumb (RoT), and which I have also found to be reliable guides (ie., the guidance was accurate, and even when it wasn’t, it was usually clear what to do next):

RoT based on improving convergence:

When you begin the model building, err on the side of more nodes
in the hidden layer.

Why? First, a few extra nodes in the hidden layer isn’t likely do any any harm–your MLP will still converge. On the other hand, too few nodes in the hidden layer can prevent convergence. Think of it this way, additional nodes provides some excess capacity–additional weights to store/release signal to the network during iteration (training, or model building). Second, if you begin with additional nodes in your hidden layer, then it’s easy to prune them later (during iteration progress). This is common and there are diagnostic techniques to assist you (e.g., Hinton Diagram, which is just a visual depiction of the weight matrices, a ‘heat map’ of the weight values,).

RoTs based on size of input layer and size of output layer:

A rule of thumb is for the size of this [hidden] layer to be somewhere
between the input layer size … and the output layer size….

To calculate the number of hidden nodes we use a general rule of:
(Number of inputs + outputs) x 2/3

RoT based on principal components:

Typically, we specify as many hidden nodes as dimensions [principal
components] needed to capture 70-90% of the variance of the input data
set
.

And yet the NN FAQ author calls these Rules “nonsense” (literally) because they: ignore the number of training instances, the noise in the targets (values of the response variables), and the complexity of the feature space.

In his view (and it always seemed to me that he knows what he’s talking about), choose the number of neurons in the hidden layer based on whether your MLP includes some form of regularization, or early stopping.

The only valid technique for optimizing the number of neurons in the Hidden Layer:

During your model building, test obsessively; testing will reveal the signatures of “incorrect” network architecture. For instance, if you begin with an MLP having a hidden layer comprised of a small number of nodes (which you will gradually increase as needed, based on test results) your training and generalization error will both be high caused by bias and underfitting.

Then increase the number of nodes in the hidden layer, one at a time, until the generalization error begins to increase, this time due to overfitting and high variance.


In practice, I do it this way:

input layer: the size of my data vactor (the number of features in my model) + 1 for the bias node and not including the response variable, of course

output layer: soley determined by my model: regression (one node) versus classification (number of nodes equivalent to the number of classes, assuming softmax)

hidden layer: to start, one hidden layer with a number of nodes equal to the size of the input layer. The “ideal” size is more likely to be smaller (i.e, some number of nodes between the number in the input layer and the number in the output layer) rather than larger–again, this is just an empirical observation, and the bulk of this observation is my own experience. If the project justified the additional time required, then I start with a single hidden layer comprised of a small number of nodes, then (as i explained just above) I add nodes to the Hidden Layer, one at a time, while calculating the generalization error, training error, bias, and variance. When generalization error has dipped and just before it begins to increase again, the number of nodes at that point is my choice. See figure below.

enter image description here

Leave a Comment