Materials (based on practical_rl course)
- Slides
- Video lecture by D. Silver - https://www.youtube.com/watch?v=KHZVXao4qXs
- Our lecture, seminar
- Alternative lecture by J. Schulman part 1 - https://www.youtube.com/watch?v=BB-BhTn6DCM
- Alternative lecture by J. Schulman part 2 - https://www.youtube.com/watch?v=Wnl-Qh2UHGg
Part 0 (not graded) - intro to gym(nasium) interface -
part 1 (5 points) - implement REINFORCE with a neural network agent -
part 2 (5-10 points) - optional advanced homework: implement either A2C OR PPO.
- A2C aka Advantage Actor Critic (5 points)
a2c-optional.ipynb. - PPO aka Proximal Policy Optimization (10 points)
ppo.ipynb
If you chose to do PPO, you don't need to submit A2C and it will award no extra points since PPO expands A2C. So either do (reinforce -> a2c) for up to 10 points OR (reinforce -> ppo) for up to 15 points.
If you choose PPO, we recommend additional materials; pick one of:
- Text materials (english): https://spinningup.openai.com/en/latest/algorithms/ppo.html (english)
- Our videos (russian): lecture, seminar(PyTorch)
-
A full-term course on reinforcement learning - practical_rl
-
Actually proving the policy gradient for discounted rewards - article
-
On variance of policy gradient and optimal baselines: article, another article
-
Generalized Advantage Estimation - a way you can speed up training for homework_*.ipynb - article
-
Generalizing log-derivative trick - url
-
Combining policy gradient and q-learning - arxiv
-
Bayesian perspective on why reparameterization & logderivative tricks matter (Vetrov's take) - pdf
-
Adversarial review of policy gradient - blog