- Slides (pdf): click here
- Russian materials: lecture, hw overview
- English materials: LLM efficiency overview | LLM speculative decoding
Note to HSE students: this week's assignment is more advanced (read "harder") than usual in terms of the engineering required. Note that you don't have to pass every assignment as long as your point total is enough for the grade.
- Max Ryabinin's sister cource in efficient DL: https://github.com/mryab/efficient-dl-systems
- Efficient training: mixed precision, distributed, etc: https://www.youtube.com/watch?v=UVX7SYGCKkA
- A rather detailed overview of DL efficiency https://alexzhang13.github.io/blog/2024/efficient-dl/
- GPU MODE Lecture 14: Practitioners Guide to Triton
- Flash-Decoding for long-context inference
- Deep Dive on the Hopper TMA Unit for FP8 GEMMs
- Persistent Matmul
- Matrix Multiplication Background User's Guide
- Deep Dive on CUTLASS Ping-Pong GEMM Kernel
- Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
- SmoothQuant paper
- SmoothQuant repo