Skip to content

Commit beca9a0

Browse files
Add section for Reinforcement Learning (#200)
This PR adds a new category for Reinforcement Learning with articles on PPO, DQN, Key Concepts, and a taxonomy of RL algorithms. It also includes the necessary navigation and index file updates. --------- Co-authored-by: Nevin Valsaraj <nevin.valsaraj32@gmail.com>
1 parent 83d0b9f commit beca9a0

9 files changed

Lines changed: 586 additions & 0 deletions

_data/navigation.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,17 @@ wiki:
203203
url: /wiki/machine-learning/distributed-training-with-pytorch/
204204
- title: Imitation Learning With a Focus on Humanoids
205205
url: /wiki/machine-learning/imitation-learning/
206+
- title: Reinforcement Learning
207+
url: /wiki/reinforcement-learning/
208+
children:
209+
- title: Key Concepts in Reinforcement Learning (RL)
210+
url: /wiki/reinforcement-learning/key-concepts-in-rl/
211+
- title: Reinforcement Learning Algorithms
212+
url: /wiki/reinforcement-learning/reinforcement-learning-algorithms/
213+
- title: Policy Gradient Methods
214+
url: /wiki/reinforcement-learning/intro-to-policy-gradient-methods/
215+
- title: Foundation of Value-Based Reinforcement Learning
216+
url: /wiki/reinforcement-learning/value-based-reinforcement-learning/
206217
- title: State Estimation
207218
url: /wiki/state-estimation/
208219
children:

assets/images/autpware_blevel.png

-676 KB
Binary file not shown.
32.6 KB
Loading
13.8 KB
Loading
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
date: 2026-04-16
3+
title: Reinforcement Learning
4+
---
5+
6+
Reinforcement Learning (RL) is a subfield of machine learning where agents learn to make decisions by interacting with an environment to maximize cumulative rewards. In robotics, RL is particularly powerful for complex control tasks, navigation, and manipulation where traditional control methods may be difficult to design.
7+
8+
This section provides a comprehensive guide to reinforcement learning, covering fundamental concepts, a taxonomy of popular algorithms, and in-depth tutorials on specific methods like Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO).
9+
10+
## Key Subsections and Highlights
11+
12+
- **[Key Concepts in Reinforcement Learning (RL)](/wiki/reinforcement-learning/key-concepts-in-rl/)**
13+
An introduction to the core components of RL, including agents, environments, states, actions, policies, and rewards. Essential for anyone starting with RL.
14+
15+
- **[A Taxonomy of Reinforcement Learning Algorithms](/wiki/reinforcement-learning/reinforcement-learning-algorithms/)**
16+
A high-level overview of the RL landscape, categorizing algorithms into model-free vs. model-based, and policy optimization vs. Q-learning.
17+
18+
- **[Proximal Policy Optimization (PPO)](/wiki/reinforcement-learning/intro-to-policy-gradient-methods/)**
19+
A detailed look at PPO, one of the most popular and stable policy gradient methods used in modern robotics and simulation.
20+
21+
- **[Deep Q-Networks (DQN)](/wiki/reinforcement-learning/value-based-reinforcement-learning/)**
22+
Explores the foundations of value-based reinforcement learning, focusing on the integration of Q-learning with deep neural networks.
23+
24+
## See Also
25+
- [Introduction to Reinforcement Learning](/wiki/machine-learning/intro-to-rl/)
26+
- [Python Libraries for Reinforcement Learning](/wiki/machine-learning/python-libraries-for-reinforcement-learning/)
27+
28+
## Further Reading
29+
- [Spinning Up in Deep RL – OpenAI](https://spinningup.openai.com/en/latest/)
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
date: 2025-05-04
3+
title: "Proximal Policy Optimization (PPO): Concepts, Theory, and Insights"
4+
---
5+
6+
Proximal Policy Optimization (PPO) is one of the most widely used algorithms in modern reinforcement learning. It combines the benefits of policy gradient methods with a set of improvements that make training more stable, sample-efficient, and easy to implement. PPO is used extensively in robotics, gaming, and simulated environments like MuJoCo and OpenAI Gym. This article explains PPO from the ground up: its motivation, theory, algorithmic structure, and practical considerations.
7+
8+
## Motivation
9+
10+
Traditional policy gradient methods suffer from instability due to large, unconstrained policy updates. While they optimize the expected return directly, updates can be so large that they lead to catastrophic performance collapse.
11+
12+
Trust Region Policy Optimization (TRPO) proposed a solution by introducing a constraint on the size of the policy update using a KL-divergence penalty. However, TRPO is relatively complex to implement because it requires solving a constrained optimization problem using second-order methods.
13+
14+
PPO was designed to simplify this by introducing a clipped surrogate objective that effectively limits how much the policy can change during each update—while retaining the benefits of trust-region-like behavior.
15+
16+
## PPO Objective
17+
18+
Let the old policy be $\pi_{\theta_{\text{old}}}$ and the new policy be $\pi_\theta$. PPO maximizes the following clipped surrogate objective:
19+
20+
$$
21+
L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[
22+
\min\left(
23+
r_t(\theta) \hat{A}_t,
24+
\text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t
25+
\right)
26+
\right]
27+
$$
28+
29+
where:
30+
31+
- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio,
32+
- $\hat{A}_t$ is the advantage estimate at time step $t$,
33+
- $\epsilon$ is a small hyperparameter (e.g., 0.1 or 0.2).
34+
35+
### Why Clipping?
36+
37+
Without clipping, large changes in the policy could lead to very large or small values of $r_t(\theta)$, resulting in destructive updates. The **clip** operation ensures that updates do not push the new policy too far from the old one.
38+
39+
This introduces a **soft trust region**: when $r_t(\theta)$ is within $[1 - \epsilon, 1 + \epsilon]$, the update proceeds normally. If $r_t(\theta)$ exceeds this range, the objective is "flattened", preventing further change.
40+
41+
## Full PPO Objective
42+
43+
In practice, PPO uses a combination of multiple objectives:
44+
45+
- **Clipped policy loss** (as above)
46+
- **Value function loss**: typically a mean squared error between predicted value and empirical return
47+
- **Entropy bonus**: to encourage exploration
48+
49+
The full loss function is:
50+
51+
$$
52+
L^{\text{PPO}}(\theta) =
53+
\mathbb{E}_t \left[
54+
L^{\text{CLIP}}(\theta)
55+
- c_1 \cdot (V_\theta(s_t) - \hat{V}_t)^2
56+
+ c_2 \cdot \mathcal{H}[\pi_\theta](s_t)
57+
\right]
58+
$$
59+
60+
where:
61+
62+
- $c_1$ and $c_2$ are weighting coefficients,
63+
- $\hat{V}_t$ is an empirical return (or bootstrapped target),
64+
- $\mathcal{H}[\pi_\theta]$ is the entropy of the policy at state $s_t$.
65+
66+
## Advantage Estimation
67+
68+
PPO relies on high-quality advantage estimates $\hat{A}_t$ to guide policy updates. The most popular technique is **Generalized Advantage Estimation (GAE)**:
69+
70+
$$
71+
\hat{A}_t = \sum_{l=0}^{T - t - 1} (\gamma \lambda)^l \delta_{t+l}
72+
$$
73+
74+
with:
75+
76+
$$
77+
\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
78+
$$
79+
80+
GAE balances the bias-variance trade-off via the $\lambda$ parameter (typically 0.95).
81+
82+
## PPO Training Loop Overview
83+
84+
At a high level, PPO training proceeds in the following way:
85+
86+
1. **Collect rollouts** using the current policy for a fixed number of steps.
87+
2. **Compute advantages** using GAE.
88+
3. **Compute returns** for value function targets.
89+
4. **Optimize the PPO objective** with multiple minibatch updates (typically using Adam).
90+
5. **Update the old policy** to match the new one.
91+
92+
Unlike TRPO, PPO allows **multiple passes through the same data**, improving sample efficiency.
93+
94+
## Practical Tips
95+
96+
- **Clip epsilon**: Usually 0.1 or 0.2. Too large allows harmful updates; too small restricts learning.
97+
- **Number of epochs**: PPO uses multiple SGD epochs (3–10) per batch.
98+
- **Batch size**: Typical values range from 2048 to 8192.
99+
- **Value/policy loss scales**: The constants $c_1$ and $c_2$ are often 0.5 and 0.01 respectively.
100+
- **Normalize advantages**: Empirically improves stability.
101+
102+
> **Entropy Bonus**: Without sufficient entropy, the policy may prematurely converge to a suboptimal deterministic strategy.
103+
104+
## Why PPO Works Well
105+
106+
- **Stable updates**: Clipping constrains updates to a trust region without expensive computations.
107+
- **On-policy training**: Leads to high-quality updates at the cost of sample reuse.
108+
- **Good performance across domains**: PPO performs well in continuous control, discrete games, and real-world robotics.
109+
- **Simplicity**: Easy to implement and debug compared to TRPO, ACER, or DDPG.
110+
111+
## PPO vs TRPO
112+
113+
| Feature | PPO | TRPO |
114+
|---------------------------|--------------------------------------|--------------------------------------|
115+
| Optimizer | First-order (SGD/Adam) | Second-order (constrained step) |
116+
| Trust region enforcement | Clipping | Explicit KL constraint |
117+
| Sample efficiency | Moderate | Low |
118+
| Stability | High | Very high |
119+
| Implementation | Simple | Complex |
120+
121+
## Limitations
122+
123+
- **On-policy nature** means PPO discards data after each update.
124+
- **Entropy decay** can hurt long-term exploration unless tuned carefully.
125+
- **Not optimal for sparse-reward environments** without additional exploration strategies (e.g., curiosity, count-based bonuses).
126+
127+
## PPO in Robotics
128+
129+
PPO has become a standard in sim-to-real training workflows:
130+
131+
- Robust to partial observability
132+
- Easy to stabilize on real robots
133+
- Compatible with parallel simulation (e.g., Isaac Gym, MuJoCo)
134+
135+
## Summary
136+
137+
PPO offers a clean and reliable solution for training RL agents using policy gradient methods. Its clipping objective balances the need for learning speed with policy stability. PPO is widely regarded as a default choice for continuous control tasks, and has been proven to work well across a broad range of applications.
138+
139+
140+
## Further Reading
141+
- [Proximal Policy Optimization Algorithms – Schulman et al. (2017)](https://arxiv.org/abs/1707.06347)
142+
- [Spinning Up PPO Overview – OpenAI](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
143+
- [CleanRL PPO Implementation](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py)
144+
- [RL Course Lecture on PPO – UC Berkeley CS285](https://rail.eecs.berkeley.edu/deeprlcourse/)
145+
- [OpenAI Gym PPO Examples](https://github.com/openai/baselines/tree/master/baselines/ppo2)
146+
- [Generalized Advantage Estimation (GAE) – Schulman et al.](https://arxiv.org/abs/1506.02438)
147+
- [PPO Implementation from Scratch – Andriy Mulyar](https://github.com/awjuliani/DeepRL-Agents)
148+
- [Deep Reinforcement Learning Hands-On (PPO chapter)](https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On)
149+
- [Stable Baselines3 PPO Documentation](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html)
150+
- [OpenReview: PPO vs TRPO Discussion](https://openreview.net/forum?id=r1etN1rtPB)
151+
- [Reinforcement Learning: State-of-the-Art Survey (2019)](https://arxiv.org/abs/1701.07274)
152+
- [RL Algorithms by Difficulty – RL Book Companion](https://huggingface.co/learn/deep-rl-course/unit2/ppo)
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
date: 2025-03-11 # YYYY-MM-DD
3+
title: Key Concepts of Reinforcement Learning
4+
---
5+
6+
This tutorial provides an introduction to the fundamental concepts of Reinforcement Learning (RL). RL involves an agent interacting with an environment to learn optimal behaviors through trial and feedback. The main objective of RL is to maximize cumulative rewards over time.
7+
8+
## Main Components of Reinforcement Learning
9+
10+
### Agent and Environment
11+
The agent is the learner or decision-maker, while the environment represents everything the agent interacts with. The agent receives observations from the environment and takes actions that influence the environment's state.
12+
13+
![Humanoid robot agent interacting with its environment](/assets/images/humanoid-robot.drawio.png)
14+
{: .full}
15+
16+
### States and Observations
17+
- A **state** (s) fully describes the world at a given moment.
18+
- An **observation** (o) is a partial view of the state.
19+
- Environments can be **fully observed** (complete information) or **partially observed** (limited information).
20+
21+
### Action Spaces
22+
- The **action space** defines all possible actions an agent can take.
23+
- **Discrete action spaces** (e.g., Atari, Go) have a finite number of actions.
24+
- **Continuous action spaces** (e.g., robotics control) allow real-valued actions.
25+
26+
## Policies
27+
A **policy** determines how an agent selects actions based on states:
28+
29+
- **Deterministic policy**: Always selects the same action for a given state.
30+
31+
$a_t = \mu(s_t)$
32+
33+
- **Stochastic policy**: Samples actions from a probability distribution.
34+
35+
$a_t \sim \pi(\cdot | s_t)$
36+
37+
38+
### Example: Deterministic Policy in PyTorch
39+
```python
40+
import torch.nn as nn
41+
42+
pi_net = nn.Sequential(
43+
nn.Linear(obs_dim, 64),
44+
nn.Tanh(),
45+
nn.Linear(64, 64),
46+
nn.Tanh(),
47+
nn.Linear(64, act_dim)
48+
)
49+
```
50+
51+
## Trajectories
52+
A **trajectory ($\tau$)** is a sequence of states and actions:
53+
$$
54+
\tau = (s_0, a_0, s_1, a_1, \dots)
55+
$$
56+
State transitions follow deterministic or stochastic rules:
57+
$$
58+
s_{t+1} = f(s_t, a_t)
59+
$$
60+
or
61+
$$
62+
s_{t+1} \sim P(\cdot|s_t, a_t)
63+
$$
64+
65+
## Reward and Return
66+
The **reward function ($R$)** determines the agent's objective:
67+
$$
68+
r_t = R(s_t, a_t, s_{t+1})
69+
$$
70+
### Types of Return
71+
1. **Finite-horizon undiscounted return**:
72+
$$
73+
R(\tau) = \sum_{t=0}^T r_t
74+
$$
75+
2. **Infinite-horizon discounted return**:
76+
$$
77+
R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t
78+
$$
79+
where $\gamma$ (discount factor) balances immediate vs. future rewards.
80+
81+
## Summary
82+
This tutorial introduced fundamental RL concepts, including agents, environments, policies, action spaces, trajectories, and rewards. These components are essential for designing RL algorithms.
83+
84+
## Further Reading
85+
- Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction*.
86+
87+
## References
88+
- [Reinforcement Learning Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning)

0 commit comments

Comments
 (0)