Reinforcement Learning: An Introduction
A comprehensive foundational textbook on reinforcement learning, covering key algorithms such as Q-learning, Sarsa, and TD-learning, while bridging the gap between tabular methods and function approximation.
Visão Geral do Curso
📚 Content Summary
A comprehensive foundational textbook on reinforcement learning, covering key algorithms such as Q-learning, Sarsa, and TD-learning, while bridging the gap between tabular methods and function approximation.
Master the definitive science of goal-directed learning from interaction.
Author: Richard S. Sutton and Andrew G. Barto
Acknowledgments: Supported by the Air Force Office of Scientific Research, the National Science Foundation, and GTE Laboratories.
🎯 Learning Objectives
- Define Reinforcement Learning and distinguish between immediate rewards and long-term value functions.
- Identify and describe the four sub-elements of a reinforcement learning system.
- Apply the Temporal-Difference (TD) update rule to a state-value lookup table.
- Implement incremental update rules for both stationary and nonstationary reward distributions.
- Evaluate the effectiveness of Optimistic Initial Values and Upper-Confidence-Bound (UCB) action selection in promoting exploration.
- Explain the mechanics of Gradient Bandit algorithms using numerical preferences and softmax distributions.
- Define the agent–environment interface and model complex tasks using the MDP framework.
- Calculate expected returns for both episodic and continuing tasks using discounting and unified notation.
- Distinguish between state-value (v_\pi) and action-value (q_\pi) functions and derive them using Bellman equations.
- Perform Policy Evaluation to compute state-value functions for arbitrary policies using iterative methods.
🔹 Lesson 1: The Reinforcement Learning Problem (Ch. 1)
Overview: This lesson defines Reinforcement Learning (RL) as a computational approach to learning from interaction, focusing on goal-directed agents. It details the four essential elements of RL systems—policy, reward, value function, and model—and demonstrates these concepts through a Tic-Tac-Toe example using the Temporal-Difference (TD) learning update rule. Finally, it traces the historical roots of RL through the convergence of optimal control (Dynamic Programming) and the psychology of trial-and-error learning (The Law of Effect).
Learning Outcomes:
- Define Reinforcement Learning and distinguish between immediate rewards and long-term value functions.
- Identify and describe the four sub-elements of a reinforcement learning system.
- Apply the Temporal-Difference (TD) update rule to a state-value lookup table.
🔹 Lesson 2: Multi-arm Bandits (Ch. 2)
Overview: This lesson covers advanced strategies for solving the multi-arm bandit problem, focusing on how agents can efficiently estimate action values and balance exploration with exploitation. We transition from simple sample-average methods to incremental implementations that can track nonstationary problems, and explore sophisticated selection mechanisms like UCB, Optimistic Initial Values, and Gradient Bandits. Finally, we introduce the concept of Associative Search (Contextual Bandits) as a bridge toward full Reinforcement Learning.
Learning Outcomes:
- Implement incremental update rules for both stationary and nonstationary reward distributions.
- Evaluate the effectiveness of Optimistic Initial Values and Upper-Confidence-Bound (UCB) action selection in promoting exploration.
- Explain the mechanics of Gradient Bandit algorithms using numerical preferences and softmax distributions.
🔹 Lesson 3: Finite Markov Decision Processes (Ch. 3)
Overview: This lesson introduces the formal mathematical framework of Finite Markov Decision Processes (MDPs), the standard way to describe Reinforcement Learning problems. It explores the interaction between an agent and its environment, defines the reward hypothesis, and establishes the fundamental Bellman equations used to calculate value functions and identify optimal policies.
Learning Outcomes:
- Define the agent–environment interface and model complex tasks using the MDP framework.
- Calculate expected returns for both episodic and continuing tasks using discounting and unified notation.
- Distinguish between state-value (v_\pi) and action-value (q_\pi) functions and derive them using Bellman equations.
🔹 Lesson 4: Dynamic Programming (Ch. 4)
Overview: This lesson covers Dynamic Programming (DP) as a collection of algorithms used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). It explores the iterative processes of Policy Evaluation and Improvement, the unification of these processes into Policy and Value Iteration, and the flexible framework of Generalized Policy Iteration (GPI), including the computational efficiency and asynchronous variations of these methods.
Learning Outcomes:
- Perform Policy Evaluation to compute state-value functions for arbitrary policies using iterative methods.
- Implement and differentiate between Policy Iteration and Value Iteration to find optimal policies and value functions.
- Understand the mechanics of Asynchronous Dynamic Programming and the conceptual framework of Generalized Policy Iteration (GPI).
🔹 Lesson 5: Monte Carlo Methods (Ch. 5)
Overview: This lesson focuses on Monte Carlo (MC) methods for estimating value functions and discovering optimal policies through sampled experience. We explore the transition from policy evaluation to control, comparing on-policy methods (including epsilon-greedy strategies) and off-policy methods that utilize importance sampling to learn about a target policy from a different behavior policy.
Learning Outcomes:
- Design and implement Monte Carlo Control algorithms with and without Exploring Starts.
- Contrast on-policy and off-policy learning architectures, specifically the role of the "coverage" assumption.
- Calculate and apply Ordinary and Weighted Importance Sampling ratios to adjust for policy divergence.
🔹 Lesson 6: Temporal-Difference Learning (Ch. 6)
Overview: This lesson covers Temporal-Difference (TD) learning, a central and novel idea in reinforcement learning. TD learning combines Monte Carlo (MC) methods, which learn from experience, with Dynamic Programming (DP) methods, which bootstrap (update estimates based on other learned estimates). We will explore TD prediction, its optimality in batch settings, and its application to control via Sarsa (on-policy), Q-learning (off-policy), and Expected Sarsa.
Learning Outcomes:
- Define TD(0) prediction and explain the concept of "bootstrapping" in the context of the TD target and error.
- Compare and contrast the optimality of TD(0) and the concept of Certainty Equivalence in batch training.
- Implement and differentiate between Sarsa (on-policy) and Q-learning (off-policy) control algorithms.
🔹 Lesson 7: Multi-step Bootstrapping and Eligibility Traces (Ch. 7)
Overview: This lesson explores the bridge between n-step TD methods and Monte Carlo techniques through the framework of eligibility traces. It introduces the lambda-return as a theoretical bridge (the Forward View) and Eligibility Traces as a mechanistic implementation (the Backward View), culminating in the study of different trace update rules such as Accumulating, Replacing, and Dutch traces.
Learning Outcomes:
- Define the lambda-return and explain how it averages n-step returns using exponential weighting.
- Contrast the Forward View (theoretical/look-ahead) with the Backward View (mechanistic/look-back) of TD(lambda).
- Implement and compare different eligibility trace mechanisms: Accumulating, Replacing, and Dutch traces.
🔹 Lesson 8: Planning and Learning with Tabular Methods (Ch. 8)
Overview: This lesson explores the integration of model-based reinforcement learning (planning) and model-free reinforcement learning (learning). It details how agents can use a model of the environment to generate simulated experience for value updates, primarily through the Dyna-Q architecture. Furthermore, the lesson compares different backup strategies (full vs. sample) and examines specialized search techniques such as Prioritized Sweeping, Trajectory Sampling, and Monte Carlo Tree Search to optimize computational efficiency.
Learning Outcomes:
- Differentiate between direct reinforcement learning and model-based planning within the Dyna architecture.
- Evaluate the computational trade-offs between full backups and sample backups based on branching factors.
- Explain how Prioritized Sweeping and Trajectory Sampling focus computation on the most informative state-action updates.
🔹 Lesson 9: Function Approximation and Policy Gradient Methods (Ch. 9-11)
Overview: This lesson explores how Reinforcement Learning scales to large or continuous state spaces by transitioning from tabular representations to Function Approximation. We cover value prediction using gradient-descent methods, various linear feature construction techniques (like Tile Coding and RBFs), and extend these concepts to control via Actor–Critic architectures and the Average-Reward (R-Learning) setting.
Learning Outcomes:
- Define the objective of value prediction using function approximation and the role of the on-policy distribution.
- Compare and implement linear feature construction methods, including Tile Coding, Radial Basis Functions (RBFs), and Kanerva Coding.
- Design and analyze Actor–Critic architectures utilizing eligibility traces for policy and value updates.
🔹 Lesson 10: Biological Foundations, Applications, and Prospects (Ch. 12-15)
Overview: This lesson explores the interdisciplinary convergence of Reinforcement Learning (RL), bridging the gap between biological intelligence (psychology and neuroscience) and complex engineering applications. Students will analyze how temporal-difference (TD) methods scale from classic games like Backgammon to high-dimensional industrial challenges like elevator dispatching, telecommunications, and job-shop scheduling, concluding with a vision for the future of the field.
Learning Outcomes:
- Identify the parallels between RL algorithms and biological learning mechanisms in psychology and neuroscience.
- Analyze the architecture and representation strategies used in classic RL successes such as TD-Gammon and the Acrobot task.
- Formulate real-world stochastic control problems using Semi-Markov Decision Processes (SMDP) in continuous-time environments.