week12_inference

Slides (pdf): click here
Russian materials: lecture, hw overview
English materials: LLM efficiency overview | LLM speculative decoding

Practice

./practice.ipynb :

Note to HSE students: this week's assignment is more advanced (read "harder") than usual in terms of the engineering required. Note that you don't have to pass every assignment as long as your point total is enough for the grade.

Extra materials:

Max Ryabinin's sister cource in efficient DL: https://github.com/mryab/efficient-dl-systems
Efficient training: mixed precision, distributed, etc: https://www.youtube.com/watch?v=UVX7SYGCKkA
A rather detailed overview of DL efficiency https://alexzhang13.github.io/blog/2024/efficient-dl/
GPU MODE Lecture 14: Practitioners Guide to Triton
Flash-Decoding for long-context inference
Deep Dive on the Hopper TMA Unit for FP8 GEMMs
Persistent Matmul
Matrix Multiplication Background User's Guide
Deep Dive on CUTLASS Ping-Pong GEMM Kernel
Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
SmoothQuant paper
SmoothQuant repo

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
practice.ipynb		practice.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Practice

Extra materials:

FilesExpand file tree

week12_inference

Directory actions

More options

Directory actions

More options

Latest commit

History

week12_inference

Folders and files

parent directory

README.md

Practice

Extra materials: