The general inefficiency of batch training for gradient descent learning
Introduction
Neural networks are often trained using algorithms that approximate gradient descent. Gradient descent learning (also called steepest descent) can be done using either a batch method or an on-line method. In batch training, weight changes are accumulated over an entire presentation of the training data (an epoch) before being applied, while on-line training updates weights after the presentation of each training example (instance). Another alternative is sometimes called mini-batch (Sarle, 2002), in which weight changes are accumulated over some number u of instances before actually updating the weights. Using an update frequency (or batch size) of u=1 results in on-line training, while u=N results in batch training, where N is the number of instances in the training set. In each case a learning rate, r, is used to adjust the size of weight changes.
Most neural network researchers would agree that the question of whether batch or on-line training is faster and/or ‘better’ has been settled. However, it is interesting to note that people have unknowingly ‘settled’ upon two opposing views. Some believe that since batch uses the ‘true’ gradient direction for its updates, it is more ‘correct,’ and some researchers claim or imply that batch training is faster. Others note that in practice on-line training appears to train more quickly and produce better results than batch training.
This paper explains why on-line training can be expected to be at least as accurate as well as faster than batch training, and demonstrates empirically that on-line training is often orders of magnitude faster than batch training, especially on large training sets.
To briefly overview the main points of this paper, batch training is able to calculate the direction of the true gradient, but it does not know how far it can safely go in that direction before the gradient changes direction (or begins to go back ‘uphill’). The true gradient can curve considerably on its way to a minimum, but batch training can only take one step for each epoch, and each step is in a straight line.
As the size of the training set grows, the accumulated weight changes for batch training become large. This leads batch training to use unreasonably large steps, which in turn leads to unstable learning and to the overshooting of curves and local minima in the error landscape. Therefore, a reduction in the size of the learning rate (or, equivalently, using the average weight change instead of the sum) is required for batch training to be stable as the size of the training set grows. However, this means that the larger the training set is, the more computation is required for a weight change of the same magnitude.
On the other hand, on-line training uses the ‘local’ gradient of each training instance to determine what direction to go. These local gradients can be noisy and contradict each other, but will on average move in the direction of the true gradient. Since on-line training updates weights after each instance, it is able to follow curves in the gradient during the course of an epoch. It can also handle any size of training set without having to reduce the learning rate.
Section 2 surveys the neural network literature to examine what is currently being taught about on-line and batch training. Some of the references make false claims or use misleading examples, while others appear entirely accurate. None of the literature surveyed, however, mentions the ability of on-line training to learn faster as a result of being able to follow the gradient more closely during an epoch and thus use a larger learning rate.
Section 3 explains the above arguments in more detail, describes each training method in terms of optimization theory, and gives several intuitive interpretations of batch and on-line training. It discusses how on-line and batch are similar in order to focus in on what is really different between the methods and how this difference gives on-line training such an advantage.
Some readers may find that Section 3 contains a bit of redundancy, in terms of explaining the same thing in several different ways. However, it is the experience of the authors that some neural network researchers have a difficult time overcoming long-held misunderstandings in this area (while others understand quickly and wonder what all the confusion is about). Hopefully by presenting the arguments in several ways there will remain little room for misunderstanding.
Section 4 presents empirical results of training multilayer perceptrons using error backpropagation with a variety of learning rates on 26 classification tasks from the UCI Machine Learning Database Repository (Blake & Merz, 1998), with both on-line and batch training. It also compares on-line, batch and mini-batch training on a speech recognition task with a large training set of 20,000 instances. The results strongly support the claim that on-line training is faster and at least as accurate as batch training, especially for large training sets. The speech recognition experiments also show that the larger the batch size is, the more on-line training outperforms batch training.
Section 4 also explains why using clusters of machines training a single neural network in parallel—which requires batch or mini-batch training—is likely to be slower (as well as more expensive) than using on-line training on a single machine to do the training alone.
Section 5 summarizes the conclusions from these results, which basically are that there are strong disadvantages and no apparent real advantages to using batch training, even for parallel training, and that on-line training should therefore be used in practical gradient descent training situations.
Section snippets
Survey of neural network literature
This section surveys the neural network literature to examine what is currently being taught about on-line and batch training. As evidenced from the literature and from recent discussions with various researchers, many are still of the opinion that batch training is as fast or faster and/or more correct than on-line training. Unfortunately, this belief may lead those who use gradient descent algorithms in practice to waste vast amounts of computational resources training neural networks
Gradient descent learning
There are many different algorithms that use a gradient descent approach to learning (Hassoun, 1995), such as the perceptron learning rule (Rosenblatt, 1962), competitive learning (Rumelhart & McClelland, 1986), reinforcement learning (Barto, 1992), and self-organizing maps (Kohonen, 1982).
Gradient descent learning attempts to find a point in some parameter space (e.g. neural network weight space) that minimizes an error (or ‘loss’) function The weight space and error function define an
Empirical results
From the discussions in preceding sections, we hypothesized that on-line training is usually faster than batch training, especially for large training sets, because it can safely use a larger learning rate, due to its ability to follow curves in the error surface. However, how much faster on-line is depends on both the size of the training set and the characteristics of the particular application. As discussed above, as the training set gets larger, we expect the speedup of on-line over batch
Conclusions
A variety of common learning algorithms use gradient descent learning, which can be trained via batch or on-line training. Some neural network practitioners have been operating under the false assumption that batch training is as fast or faster than on-line training, and/or that batch training is more correct. In this paper we have explained that on-line training learns faster than batch training because it takes many steps per epoch which can follow curves in the gradient. On-line training
References (27)
A scaled conjugate gradient algorithm for fast supervised learning
Neural Networks
(1993)- et al.
New results on recurrent network training: unifying the algorithms and accelerating convergence
IEEE Transactions on Neural Networks
(2000) Reinforcement learning and adaptive critic methods
- et al.
Improving the convergence of backpropagation learning with second order methods
- Bengio, Y (1991). Artificial neural networks and their application to sequence recognition, PhD Thesis, McGill...
Neural networks for speech and sequence recognition
(1996)Neural networks for pattern recognition
(1997)- Blake, C.L., Merz, C.J (1998). UCI Repository of Machine Learning Databases...
- et al.
Speaker independent isolated digit recognition: Multi-layer perceptrons vs. dynamic time-warping
Neural Networks
(1990) - et al.
Neural network toolbox user's guide
(1994)
Cited by (316)
Metaheuristic design of feedforward neural networks: A review of two decades of research
2017, Engineering Applications of Artificial IntelligenceEmpirical Asset Pricing via Machine Learning
2020, Review of Financial StudiesWorking hard to know your neighbor's margins: Local descriptor learning loss
2017, Advances in Neural Information Processing SystemsFaceNet: A unified embedding for face recognition and clustering
2015, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern RecognitionNeural networks and statistical learning
2014, Neural Networks and Statistical LearningBecoming syntactic
2006, Psychological Review
- 1
- Tel.: +1-801-422-6464; fax: +1-801-422-7775.
Copyright © 2003 Elsevier Ltd. All rights reserved.