How to calculate optimal batch size

From the recent Deep Learning book by Goodfellow et al., chapter 8:

Minibatch sizes are generally driven by the following factors:

  • Larger batches provide a more accurate estimate of the gradient, but
    with less than linear returns.
  • Multicore architectures are usually
    underutilized by extremely small batches. This motivates using some
    absolute minimum batch size, below which there is no reduction in the
    time to process a minibatch.
  • If all examples in the batch are to be
    processed in parallel (as is typically the case), then the amount of
    memory scales with the batch size. For many hardware setups this is
    the limiting factor in batch size.
  • Some kinds of hardware achieve
    better runtime with speciļ¬c sizes of arrays. Especially when using
    GPUs, it is common for power of 2 batch sizes to offer better runtime.
    Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes
    being attempted for large models.
  • Small batches can offer a
    regularizing effect (Wilson and Martinez, 2003), perhaps due to the
    noise they add to the learning process. Generalization error is often
    best for a batch size of 1. Training with such a small batch size
    might require a small learning rate to maintain stability because of
    the high variance in the estimate of the gradient. The total runtime
    can be very high as a result of the need to make more steps, both
    because of the reduced learning rate and because it takes more steps
    to observe the entire training set.

Which in practice usually means “in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory“.

You might want also to consult several good posts here in Stack Exchange:

Just keep in mind that the paper by Keskar et al. ‘On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima‘, quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

Hope this helps…

UPDATE (Dec 2017):

There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

UPDATE (Mar 2021):

Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:

The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

Leave a Comment