Select Page

The History of Reinforcement Learning

by | Data Science, Machine Learning, Research

Reinforcement learning (RL) is an exciting and rapidly developing area of machine learning that significantly impacts the future of technology and our everyday lives. RL is a field separate from supervised and unsupervised learning focusing on solving problems through a sequence or sequences of decisions optimized by maximizing the accrual of rewards received by taking correct decisions. RL originates from animal learning in experimental psychology and optimal control theory whilst also drawing from and contributing to neuroscience. This article will give a brief overview of the history of RL, from its origins to the modern day. However, this article will not cover the technicalities of RL in-depth, which we will cover in a separate article. Through this article, you will gain an insight into the motivation for development in RL, the different philosophies that drive the field and what fascinating discoveries have been made and lie ahead in the future.

infographic for history of reinforcement
Infographic for History of Reinforcement Learning

Origins in Animal Learning

The origin of RL is two-pronged, namely, animal learning and optimal control. Starting with animal learning, Edward Thorndike described the essence of trial-and-error learning with the “Law of Effect” in 1911, which, to paraphrase, states that an animal will pursue the repetition of actions if they reinforce satisfaction and will be deterred from actions that produce discomfort. Furthermore, the greater the level of pleasure or pain, the greater the pursuit or deterrence from the action. The Law of Effect describes the effect of reinforcing behaviour from positive stimuli and is widely regarded as the basic defining principle of future descriptions of behaviour. The Law of Effect combines selectional and associative learning, where selectional involves trying alternatives and selecting from them based on the outcomes, associative is where options are found by selection associated with particular situations. The term reinforcement was formally used in the context of animal learning in 1927 by Pavlov, who described reinforcement as the strengthening of a pattern of behaviour due to an animal receiving a stimulus – a reinforcer – in a time-dependent relationship with another stimulus or with a response.

Picture of Thorndike's Cat Box
Thorndike’s Cat Box. Source

Turing’s Unorganised Machines

In 1948, Alan Turing presented a visionary survey of the prospect of constructing machines capable of intelligent behaviour in a report called “Intelligent Machinery”. Turing may have been the first to suggest using randomly connected networks of neuron-like nodes to perform computation and proposed the construction of large, brain-like networks of such neurons capable of being trained as one would teach a child. Turing called his networks “unorganized machines”.

Turing described three types of unorganized machines. A-type and B-type unorganized machines consist of randomly connected two-state neurons. The P-type unorganized machines, which are not neuron-like, have “only two interfering inputs, one for pleasure or reward and the other for pain or punishment”. Turing studied P-type machines to try and discover training procedures analogous to children learning. He stated that by applying “appropriate inference, mimicking education”, a B-type machine can be trained to “do any required job, given sufficient time and provided the number of units is sufficient”.

Trial-and-error learning led to the production of many electro-mechanical machines. Thomas Ross, in 1933 built a machine that could find its way through a simple maze and remember the path through the configuration of switches. In 1952, Claude Shannon demonstrated a maze-running mouse named Theseus that used trial-and-error to navigate a maze. The maze itself remembered the successive directions using magnets and relays under its floor. In 1954, Marvin Minsky discussed computational reinforcement learning methods and described his construction of an analogue machine composed of components he named SNARCs (Stochastic Neural-Analog Reinforcement Calculators). SNARCs were intended to resemble modifiable synaptic connections in the brain. In 1961 Minsky addressed the Credit Assignment Problem, which describes how to distribute credit for success among the decisions that may have contributed.

Research in computational trial-and-error processes eventually generalized to pattern recognition before being absorbed into supervised learning, where error information is used to update neuron connection weights. Investigation into RL faded throughout the 1960s and 1970s. However, in 1963, although relatively unknown, John Andreae developed pioneering research, including the STELLA system, which learns through interaction with its environment, and machines with an “internal monologue” and then later machines that can learn from a teacher.

Photo of Minsky's SNARC
Photo of Minsky’s SNARC. Source

Origins in Optimal Control

Optimal Control research began in the 1950s as a formal framework to define optimization methods to derive control policies in continuous time control problems, as shown by Pontryagin and Neustadt in 1962. Richard Bellman developed dynamic programming as both a mathematical optimization and computer programming method to solve control problems. The process defines a functional equation using the dynamic system’s state and returns what is referred to as an optimal value function. The optimal function is commonly referred to as the Bellman equation. Bellman introduced the Markovian Decision Process (MDP), which we define as a discrete stochastic version of the optimal control problem. Ronald Howard, in 1960 devised the policy iteration method for MDPs. All of these are essential elements underpinning the theory and algorithms of modern reinforcement learning.

What is the difference between reinforcement learning and optimal control?

A common question that arises is what is the difference between optimal control and reinforcement learning. The modern understanding appreciates all of the work in optimal control as related work in reinforcement learning. Reinforcement learning problems are closely associated with optimal control problems, particularly stochastic ones such as those formulated as MDPs. Solution methods of optimal control such as dynamic programming are also considered reinforcement learning methods., they gradually reach the correct answer through successive approximations. Reinforcement learning can be thought of as generalizing or extending ideas from optimal control to non-traditional control problems.

Learning Automata

In the early 1960s, research in learning automata commenced and can be traced back to Michael Lvovitch Tsetlin in the Soviet Union. A learning automaton is an adaptive decision-making unit situated in a random environment that learns the optimal action through repeated interactions with its environment. The steps are chosen according to a specific probability distribution, based on the response from the environment. Learning automata are considered as policy iterators in RL. Tsetlin devised the Tsetlin Automaton, which is regarded as an even more fundamental and versatile learning mechanism than the artificial neuron. The Tsetlin Automaton is one of the pioneering solutions to the well-known multi-armed bandit problem and continues to be used for pattern classification and formed the core of more advanced learning automata designs, including decentralized control and equi-partitioning and faulty dichotomous search.

Hedonistic Neurons

In the late 1970s and early 1980s, Harry Klopf was dissatisfied with the focus on equilibrium-seeking processes for explaining natural intelligence and providing a basis for machine intelligence. These include homeostasis and error-correction learning commonly associated with supervised learning. He argued that systems that try to maximize a quantity are qualitatively different from equilibrium seeking systems. Furthermore, he argued that maximizing systems is integral to understanding crucial aspects of natural intelligence and building artificial intelligence. Klopf hypothesized that neurons are individually hedonistic in that they work to maximize a neuron-local analogue of pleasure while minimizing a neuron-local of pain.

Annotated diagram of a neuron
Annotated diagram of a neuron. Source

Klopf’s idea of hedonistic neurons was that neurons implement a neuron-local version of the law of effect. He hypothesized that the synaptic weights of neurons change with experience. When a neuron fires an action potential, all the synapses that contribute to the action potential change the efficacies. If the action potential is rewarded, all eligible synapses’ effectiveness increases (or decrease if punished). Therefore, synapses change to alter the neuron’s firing patterns to increase the neuron’s probability of being rewarded and reduce the likelihood of being penalized by its environment.

This hypothesis produces a significant distinction between supervised learning, which is essentially an equilibrium-seeking process and reinforcement learning, which is effectively an evaluation-driven system where the learner’s decisions evolve in response to its experiences. Both error correction and RL are optimization processes, but error correction is more restricted, and RL is generalized and motivated by maximizing rewards through action optimization.

Overlap of Neurobiology and Reinforcement Learning

Neurobiology has explored the different forms of learning, namely unsupervised, supervised, and reinforcement learning within the cortex-cerebellum-basal ganglia system. Disentangling these learning processes and assigning their implementation to distinct brain areas has been a fundamental challenge for research in neurosciences. Several data suggested by Houk and Wise in 1995 and Schultz in 1997 indicate that the neuromodulation dopamine provides basal ganglia target structures with phasic signals that convey a reward prediction error that can influence reinforcement learning processes. It is, therefore, possible to characterize the functionality of the basal ganglia as an abstract search through the space of possible actions guided by dopaminergic feedback.

Recent research has explored how the three brain areas form a highly integrated system combining the different learning mechanisms into a super-learning process allowing for learning of flexible motor behaviour. Super-learning refers to the other learning mechanisms acting in synergy as opposed to in isolation.

Temporal Difference

Temporal difference (TD) learning is inspired by mathematical differentiation and aims to build accurate reward predictions from delayed rewards. TD tries to predict the combination of immediate reward and its reward prediction at the next time step. When the next time step arrives, the latest prediction is compared against what it was expected to be with new information. If there is a difference, the algorithm calculates the error, which is the “temporal difference” to adjust the old prediction towards the latest forecast. The algorithm aims to bring the old and new predictions closer together at every time step, ensuring the entire chain of predictions incrementally becomes more accurate.

TD learning is most closely associated with Sutton, whose 1984 PhD dissertation addressed TD learning and whose 1988 paper, in which the term Temporal Difference was first used, has become the definitive reference.

The origins of temporal difference methods are motivated strongly by theories of animal learning, particularly the concept of secondary reinforcers. A secondary reinforcer is a stimulus paired with a primary reinforcer, for example, the presence of food. The secondary reinforcer adopts similar properties to the primary. The temporal difference method entwined with the trial-and-error method when Klopf in 1972 and 1975 explored reinforcement learning in large systems as decomposed into individual sub-components of the more extensive procedure, each with their excitatory inputs rewards and inhibitory inputs as punishments, and each could reinforce one another.

Incorporating animal learning theory with methods of learning driven by changes in temporally successive predictions, including the temporal credit assignment problem, led to an explosion in reinforcement learning research. In particular, the development of the ‘Actor-Critic Architecture’ as applied to the pole-balancing problem by Barto et al. in 1983. Actor-critic methods are TD methods with a separate memory structure to explicitly represent the policy independent of the value function. The policy structure, which is used to select actions, is known as the actor, and the estimated value function, which criticizes the actions made by the actor, is known as the critic. The critique takes the form of a TD error, which is the sole output of the critic and drives all learning in both actor and critic. In 1984 and 1986, the Actor-Critic architecture was extended to integrate with backpropagation neural network techniques.

TD Gammon

In 1992, Gerry Tesauro developed a programme that required little backgammon knowledge yet learned to play the game at the grandmaster level. The learning algorithm combined the TD-lambda algorithm and a non-linear function approximation using a multilayer neural network trained by backpropagating TD errors. Based on TD-Gammon’s success and further analysis, the best human players now play the unconventional opening positions learned by the algorithm.

Gerry Tesauro demonstrating TD-Gammon
Gerald Tesauro with TD-Gammon. Source


Chris Watkins introduced Q-learning in 1989 in his PhD thesis “Learning from Delayed Rewards”, which introduced a model of reinforcement learning as incrementally optimizing control of a Markovian Decision Process and proposed Q-learning as a way to learn optimal control directly without modelling the transition probabilities or expected rewards of the Markovian Decision Process. Watkins and Peter Dayan presented a convergence proof in 1992. A Q-value function shows us how good a specific action is, given a state for an agent following a policy. Q-learning is the process of iteratively updating Q-values for each state-action pair using the Bellman Equation until the Q-function eventually converges to Q*. Q-learning is a model-free reinforcement learning algorithm and can handle stochastic transitions and rewards without adaptations.

These innovations stimulated significant subsequent research in reinforcement learning.

Deep Reinforcement Learning and Deep Q-learning

Photo of Ke Jie playing AlphaGo
At the time, World Number 1 Go player Ke Jie during his second match against AlphaGo in Wuzhen, 2017. Source

Alongside the rising interest in neural networks beginning in the mid-1980s, interest grew in deep reinforcement learning, where a neural network represents policies or value functions. TD-Gammon was the first successful application of reinforcement learning with neural networks. Around 2012, a deep learning revolution was spurred by the fast implementations of convolutional neural networks on graphical processing units for computer vision, which led to an increased interest in using deep neural networks as function approximations across various domains. Applying neural networks is particularly useful for replacing value iteration algorithms that directly update q-value tables as the agent learns. Value iteration is suitable for tasks with a small state space, but if there are more complex environments, the number of computational resources and time needed to traverse the new state and modify Q-values will be extremely prohibitive or unfeasible. Instead of computing Q-values directly through value iterations, a function approximation can estimate the optimal Q-function. A neural network receives states from an environment as input and outputs estimated Q-values for each action an agent can choose in those states. The values are compared to Q* target values to calculate the loss. The neural network’s weights are updated using backpropagation and stochastic gradient descent to produce Q-values that minimize the error and converge on the agent’s optimal actions.

Google DeepMind and Video Games

Around 2013, DeepMind developed deep Q-learning, a combination of convolution neural network architecture and Q-learning. Deep Q-learning facilitates Experience Replay, which stores and replays states and allows the network to learn in small batches to avoid skewing training and speed up implementation. They tested the system on video games such as Space Invaders and Breakout. Without altering the code, the network learns how to play the game and, after several iterations, surpasses human performance. DeepMind published further research on their system surpassing human abilities in other games like Seaquest and Q*Bert.

DeepMind Algorithhm learning the game Breakout


In 2014, DeepMind published research on the computer program able to play Go. In October 2015, a computer Go program called AlphaGo beat the European Go champion, Fan Hui. This event was the first time artificial intelligence defeated a professional Go player. In March 2016, AlphaGo beat Lee Sedol, one of the highest-ranked players in the world, with a score of 4-1 in a five-game match. In the 2017 Future of Go Summit, AlphaGo won a three-game match with Ke Jie, who was the number one ranked player in the world for two years. Later that year, an improved version of AlphaGo, AlphaGo Zero, defeated AlphaGo 100 games to 0. This new version beat its predecessor after three days with less processing power than AlphaGo, which comparatively took months to learn how to play.

In Google’s work with AlphaZero in 2017, the system was able to play chess at a superhuman level within four hours of training, using 5,000 first-generation tensor processing units and 64 second-generation tensor processing units. 21

Modern Developments

The research community is still in the early stages of understanding thoroughly how practical deep reinforcement learning is to other domains. AlphaFold, developed by DeepMind, applies artificial intelligence to amino acid folding, one of the most important goals pursued by computational biology and is essential for medicinal applications such as drug design and biotechnology such as novel enzyme design. Deep reinforcement learning has shown extreme proficiency in solving problems within constrained environments. Potential real-life applications include robotics, processing of structured medical images, self-driving cars.

Diagram of amino acid folding
Amino acid folding, source

Developments have been made into making deep reinforcement learning more efficient. Google Brain proposed Adaptive Behavior Policy Sharing, an optimization strategy that allows for selective information sharing across a pool of agents. DeepMind published research in 2020, exploring the Never Give Up strategy, which uses k-nearest neighbours over the agent’s recent experience to train the directed exploratory policies to solve complex exploration games.

Concluding Remarks

Reinforcement learning has an extensive history with a fascinating cross-pollination of ideas, generating research that sent waves through behavioural science, cognitive neuroscience, machine learning, optimal control, and others. This field of study has evolved rapidly since its inception in the 1950s, where the theory and concepts were fleshed out, to the application of theory through neural networks leading to the conquering of electronic video games and the advanced board games Backgammon, Chess, and Go. The fantastic exploits in gaming have given researchers valuable insights into the applicability and limitations of deep reinforcement learning. Deep reinforcement learning can be computationally prohibitive to achieve the most acclaimed performance seen. New approaches are being explored, such as multi-environment training and leveraging language modelling to extract high-level extractions to learn more efficiently. The question of whether deep reinforcement learning is a step toward artificial general intelligence is still an open one, given that reinforcement learning works best in constrained environments. Generalization is the most significant hurdle to overcome. However, artificial general intelligence need not be the ultimate pursuit for this strand of research. We will see in the future that reinforcement learning will continue to augment modern society through robotics, medicine, business and industry. As computing resources become more available, the entry-level innovation in reinforcement learning will lower, meaning research will not be limited to the behemoth tech companies like Google. It appears that reinforcement learning has a long and bright future and will continue to be an area of exciting research in artificial intelligence.

Thank you for reading to the end of this article. I hope through reading this; you can see how complex and multi-faceted the origins and developments within RL are and how RL as a field of research receives contributions from neighbouring areas and reciprocates insight and innovation in kind. Would you please explore the links provided throughout to dig further into the rich history of RL? If you are excited to learn more about reinforcement learning, stay tuned for a more technical exploration of the fundamentals of reinforcement learning. Meanwhile, if you are interested in learning a more comprehensive history of machine learning and artificial intelligence, please click on another article on my site titled “The History of Machine Learning“. For more information on what it takes to be a research scientist and the differences between that role and the role of a data scientist or data engineer, please click through to my article titled “Key Differences Between Data Scientist, Research Scientist, and Machine Learning Engineer Roles“.

See you soon, and happy researching!