
A Gentle Introduction to Q-Learning
Image by Editor | ChatGPT
Introduction
Reinforcement learning is a relatively lesser-known area of artificial intelligence (AI) compared to highly popular subfields today, such as machine learning, deep learning, and natural language processing. However, it holds significant potential to solve complex decision-making problems where “intelligent” software entities called agents must learn to solve a problem through interaction with their environment.
Reinforcement learning enables agents to learn through experience, maximizing cumulative rewards over time by performing a sequence of actions based on decisions made. One of the most widely used algorithms in reinforcement learning is Q-learning, which examines how an agent learns the value of actions in different states without requiring a complete model of the environment in which it operates.
This article provides a gentle introduction to Q-learning, its principles, and the basic characteristics of its algorithms, presented in a clear and illustrative tone.
Before going further, if you are new to reinforcement learning, we recommend you have a look at this introductory article that covers some basic concepts used later, like value function, policy, and so on.
Q-Learning Basics
Q-learning belongs to a family of reinforcement learning algorithms called temporal difference learning, or TD learning for short. In TD learning, an agent learns directly from experience by repeatedly sampling to estimate a value function, but at the same time, it also bootstraps—that is, it updates its value estimates based on other learned estimates, rather than waiting for final outcomes, thereby not requiring full knowledge of the environment or future rewards.
For example, consider a delivery robot in a warehouse that must learn the most efficient path from the entrance to various storage bins, while avoiding obstacles and minimizing travel time. Implementing TD learning, the robot samples possible actions to take—e.g., move forward, move to the left, etc.—by navigating through the warehouse: it chooses directions, observes where the path ends up, and receives time or penalty feedback for each move. In addition, it bootstraps by updating its value estimates for the current location based on the estimated value of the following location it navigates to, rather than waiting until the delivery trajectory is complete to assess how good each decision was.
Q-learning is a reinforcement learning method that, without needing a model of its environment, helps an agent figure out the best choices to make to get the biggest reward, simply by trying options and learning from what happens next. The “Q” in its name stands for quality, as the goal is to learn which sequences of actions are the most rewarding in different situations. Unlike other methods that need to understand how the “world” (e.g., the physical warehouse in the previous example) works in advance, Q-learning learns directly from experience. Also, while some other algorithms learn from the exact strategy they are using, Q-learning behaves more flexibly—it adopts a broader learning approach by comparing the results of alternate strategies rather than solely focusing on the one it is currently following.
A Gentle Example: Warehouse Grid
The following example illustrates, in a gentle tone and without complex math, how Q-learning works. Note that for a full understanding of the math underlying Q-learning, for instance, the Bellman Equation, you may want to access further readings like this.
Back to the example scenario of the delivery robot operating in a small warehouse, suppose the facilities are represented by a 3×3 grid of physical positions, as follows:
[ A ] [ B ] [ C ]
[ D ] [ E ] [ F ]
[ G ] [ H ] [ Goal ]
Say the robot starts at location A, and it wants to reach the “Goal” position at the bottom-right corner. Each move costs time, thereby incurring a small penalty or loss. Moreover, due to the nature of the facilities and the problem to address, hitting a wall or moving in the wrong direction is discouraged, whereas reaching the goal gives a reward.
At each step and location (state), the robot can try one of four possible actions: move up, move down, move right, or move left.
A crucial element in Q-learning is a “lookup table,” resembling a memory notebook, in which the robot keeps track of the reward of each possible action at each state. The rewards are numerically represented: the higher, the better. Furthermore, they are dynamically updated: the robot iteratively updates or fine-tunes these values based on its experience. Let’s suppose after a few trials, the robot has learned the following about the rewards of certain actions at certain states it has experienced so far:
Location | Move Right | Move Down | Move Left | Move Up |
---|---|---|---|---|
A | 0.1 | 0.3 | — | — |
B | 0.0 | 0.1 | 0.2 | — |
E | 0.4 | 0.7 | 0.2 | 0.1 |
H | 1.0 | — | 0.5 | 0.3 |
It is important to clarify that at the beginning, the robot doesn’t know anything, and all reward values would default to zero or another initialized value. It must start by trying actions at random and seeing what happens before building an approximate view of the environment:
- Suppose it starts at A and tries going down, ending up in D. If that is a busy route full of obstacles, it may have taken time, probably not being the best immediate action.
- If it later tries moving to the right from A to B, then down to E, then to H, and finally reaches the Goal state in a reasonable time, it may update values in the table to reflect these state-action choices as good ones. In Q-learning, not only are the short-term effects of the immediately chosen action taken into account, but the propagated effects of subsequent actions are also considered to some extent.
In sum, every time the robot (agent) tries a path, it slightly updates the values in its table, increasingly calibrating them depending on what has worked better so far.
In the long run, by applying this behavior, the agent ends up learning from its own experience, updating the so-called Q-table to reflect courses of action that yielded better outcomes. Not only does it learn the best route(s) from the initial position, but it also learns what to avoid, e.g., bouncing into walls, going into corners, etc. All without a full knowledge representation of the environment or, put another way, without a detailed map of the warehouse.
Concluding Remarks
Q-learning is equivalent to learning to play a game where choices must be made continuously by playing many times, remembering what yielded better results, and gradually adjusting initially random choices into more intelligent ones to improve results. This article provided a gentle and math-free introduction to this area of reinforcement learning, which constituted one of the field’s breakthroughs back in its day.