The general inefficiency of batch training for gradient descent learning

doi:10.1016/S0893-6080(03)00138-2

Neural Networks

Volume 16, Issue 10, December 2003, Pages 1429-1451

https://doi.org/10.1016/S0893-6080(03)00138-2 Get rights and content

Abstract

Gradient descent training of neural networks can be done in either a batch or on-line manner. A widely held myth in the neural network community is that batch training is as fast or faster and/or more ‘correct’ than on-line training because it supposedly uses a better approximation of the true gradient for its weight updates. This paper explains why batch training is almost always slower than on-line training—often orders of magnitude slower—especially on large training sets. The main reason is due to the ability of on-line training to follow curves in the error surface throughout each epoch, which allows it to safely use a larger learning rate and thus converge with less iterations through the training data. Empirical results on a large (20,000-instance) speech recognition task and on 26 other learning tasks demonstrate that convergence can be reached significantly faster using on-line training than batch training, with no apparent difference in accuracy.

Introduction

Neural networks are often trained using algorithms that approximate gradient descent. Gradient descent learning (also called steepest descent) can be done using either a batch method or an on-line method. In batch training, weight changes are accumulated over an entire presentation of the training data (an epoch) before being applied, while on-line training updates weights after the presentation of each training example (instance). Another alternative is sometimes called mini-batch (Sarle, 2002), in which weight changes are accumulated over some number u of instances before actually updating the weights. Using an update frequency (or batch size) of u=1 results in on-line training, while u=N results in batch training, where N is the number of instances in the training set. In each case a learning rate, r, is used to adjust the size of weight changes.

Most neural network researchers would agree that the question of whether batch or on-line training is faster and/or ‘better’ has been settled. However, it is interesting to note that people have unknowingly ‘settled’ upon two opposing views. Some believe that since batch uses the ‘true’ gradient direction for its updates, it is more ‘correct,’ and some researchers claim or imply that batch training is faster. Others note that in practice on-line training appears to train more quickly and produce better results than batch training.

This paper explains why on-line training can be expected to be at least as accurate as well as faster than batch training, and demonstrates empirically that on-line training is often orders of magnitude faster than batch training, especially on large training sets.

To briefly overview the main points of this paper, batch training is able to calculate the direction of the true gradient, but it does not know how far it can safely go in that direction before the gradient changes direction (or begins to go back ‘uphill’). The true gradient can curve considerably on its way to a minimum, but batch training can only take one step for each epoch, and each step is in a straight line.

As the size of the training set grows, the accumulated weight changes for batch training become large. This leads batch training to use unreasonably large steps, which in turn leads to unstable learning and to the overshooting of curves and local minima in the error landscape. Therefore, a reduction in the size of the learning rate (or, equivalently, using the average weight change instead of the sum) is required for batch training to be stable as the size of the training set grows. However, this means that the larger the training set is, the more computation is required for a weight change of the same magnitude.

On the other hand, on-line training uses the ‘local’ gradient of each training instance to determine what direction to go. These local gradients can be noisy and contradict each other, but will on average move in the direction of the true gradient. Since on-line training updates weights after each instance, it is able to follow curves in the gradient during the course of an epoch. It can also handle any size of training set without having to reduce the learning rate.

Section 2 surveys the neural network literature to examine what is currently being taught about on-line and batch training. Some of the references make false claims or use misleading examples, while others appear entirely accurate. None of the literature surveyed, however, mentions the ability of on-line training to learn faster as a result of being able to follow the gradient more closely during an epoch and thus use a larger learning rate.

Section 3 explains the above arguments in more detail, describes each training method in terms of optimization theory, and gives several intuitive interpretations of batch and on-line training. It discusses how on-line and batch are similar in order to focus in on what is really different between the methods and how this difference gives on-line training such an advantage.

Some readers may find that Section 3 contains a bit of redundancy, in terms of explaining the same thing in several different ways. However, it is the experience of the authors that some neural network researchers have a difficult time overcoming long-held misunderstandings in this area (while others understand quickly and wonder what all the confusion is about). Hopefully by presenting the arguments in several ways there will remain little room for misunderstanding.

Section 4 presents empirical results of training multilayer perceptrons using error backpropagation with a variety of learning rates on 26 classification tasks from the UCI Machine Learning Database Repository (Blake & Merz, 1998), with both on-line and batch training. It also compares on-line, batch and mini-batch training on a speech recognition task with a large training set of 20,000 instances. The results strongly support the claim that on-line training is faster and at least as accurate as batch training, especially for large training sets. The speech recognition experiments also show that the larger the batch size is, the more on-line training outperforms batch training.

Section 4 also explains why using clusters of machines training a single neural network in parallel—which requires batch or mini-batch training—is likely to be slower (as well as more expensive) than using on-line training on a single machine to do the training alone.

Section 5 summarizes the conclusions from these results, which basically are that there are strong disadvantages and no apparent real advantages to using batch training, even for parallel training, and that on-line training should therefore be used in practical gradient descent training situations.

Section snippets

Survey of neural network literature

This section surveys the neural network literature to examine what is currently being taught about on-line and batch training. As evidenced from the literature and from recent discussions with various researchers, many are still of the opinion that batch training is as fast or faster and/or more correct than on-line training. Unfortunately, this belief may lead those who use gradient descent algorithms in practice to waste vast amounts of computational resources training neural networks

Gradient descent learning

There are many different algorithms that use a gradient descent approach to learning (Hassoun, 1995), such as the perceptron learning rule (Rosenblatt, 1962), competitive learning (Rumelhart & McClelland, 1986), reinforcement learning (Barto, 1992), and self-organizing maps (Kohonen, 1982).

Gradient descent learning attempts to find a point

W

in some parameter space (e.g. neural network weight space) that minimizes an error (or ‘loss’) function

L(W).

The weight space and error function define an

Empirical results

From the discussions in preceding sections, we hypothesized that on-line training is usually faster than batch training, especially for large training sets, because it can safely use a larger learning rate, due to its ability to follow curves in the error surface. However, how much faster on-line is depends on both the size of the training set and the characteristics of the particular application. As discussed above, as the training set gets larger, we expect the speedup of on-line over batch

Conclusions

A variety of common learning algorithms use gradient descent learning, which can be trained via batch or on-line training. Some neural network practitioners have been operating under the false assumption that batch training is as fast or faster than on-line training, and/or that batch training is more correct. In this paper we have explained that on-line training learns faster than batch training because it takes many steps per epoch which can follow curves in the gradient. On-line training

References (27)

M.F. Møller
A scaled conjugate gradient algorithm for fast supervised learning
Neural Networks
(1993)
A.F. Atiya et al.
New results on recurrent network training: unifying the algorithms and accelerating convergence
IEEE Transactions on Neural Networks
(2000)
A.G. Barto
Reinforcement learning and adaptive critic methods
S. Becker et al.
Improving the convergence of backpropagation learning with second order methods
Bengio, Y (1991). Artificial neural networks and their application to sequence recognition, PhD Thesis, McGill...
Y. Bengio
Neural networks for speech and sequence recognition
(1996)
C.M. Bishop
Neural networks for pattern recognition
(1997)
Blake, C.L., Merz, C.J (1998). UCI Repository of Machine Learning Databases...
L. Bouttou et al.
Speaker independent isolated digit recognition: Multi-layer perceptrons vs. dynamic time-warping
Neural Networks
(1990)
H. Demuth et al.
Neural network toolbox user's guide
(1994)

Cited by (316)

Metaheuristic design of feedforward neural networks: A review of two decades of research
2017, Engineering Applications of Artificial Intelligence
Empirical Asset Pricing via Machine Learning
2020, Review of Financial Studies
Working hard to know your neighbor's margins: Local descriptor learning loss
2017, Advances in Neural Information Processing Systems
FaceNet: A unified embedding for face recognition and clustering
2015, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Neural networks and statistical learning
2014, Neural Networks and Statistical Learning
Becoming syntactic
2006, Psychological Review

View all citing articles on Scopus

¹: Tel.: +1-801-422-6464; fax: +1-801-422-7775.

View full text

The general inefficiency of batch training for gradient descent learning

Abstract

Introduction

Section snippets

Survey of neural network literature

Gradient descent learning

Empirical results

Conclusions

Neural Networks

New results on recurrent network training: unifying the algorithms and accelerating convergence

IEEE Transactions on Neural Networks

Reinforcement learning and adaptive critic methods

Improving the convergence of backpropagation learning with second order methods

Neural networks for speech and sequence recognition

Neural networks for pattern recognition

Speaker independent isolated digit recognition: Multi-layer perceptrons vs. dynamic time-warping

Neural Networks

Neural network toolbox user's guide