This tutorial focuses on the increasingly important topic of Efficient Inference for LLMs and aims to provide a systematic understanding of key facts and methodologies from a designer’s perspective. We start by introducing the fundamental concepts and mechanisms of modern LLMs, along with the relevant software and hardware. Following this, we formally define the efficiency optimization problem. To equip the audience with a designer’s mindset, we will explain how to diagnose efficiency bottlenecks for a given workload on specific hardware. In particular, we will demonstrate how to use the basic theoretical roofline model and the NVIDIA toolchain to identify these bottlenecks.
With the tools at our disposal, we will begin with a conceptual analysis of the key factors contributing to inefficiency, namely the autoregressive sampling scheme, model size,and the core attention operator. Next, we will introduce our full-stack taxonomy of efficient inference methods for LLMs. we classify efficient inference methods into algorithm-, modeland system-level ones. (1) Algorithm-level optimization includes efficient decoding methods, input compression methods, as well as alternative generative paradigms beyond the autoregressive model. (2) Model-level optimization designs efficient model structures or cuts down model-level redundancy statically or dynamically. (3) System-level optimization optimizes the inference engine or the serving system without altering the model computation graph. We will walk through each category of methodology, using one to three representative methods as examples for each leaf subcategory, elaborating on the design logic behind each method and which inefficiency factors they primarily address. Finally, we will wrap up with a few demonstrations, a takeaway summary, and future research directions.
Our tutorial will be held at A104–A105, 09:00 - 12:30, Saturday, November 8, 2025.
| Time | Section | Presenter | |
|---|---|---|---|
| 9:00—9:45 (45min) | Part I: Background, Preliminary, Problem Definition and Analysis, Practical Pipeline | Ning, Xuefei | |
| 9:45—10:30 (45min) | Part II: Model-Level Optimization | Hou, Lu | |
| 10:30—11:00 (30min) | Coffee Break | ||
| 11:00—11:40 (40min) | Part III: System-Level Optimization | Dai, Guohao | |
| 11:40—12:20 (40min) | Part IV: Algorithm-Level Optimization, Conclusion | BAI, Haoli | |
| 12:20—12:30 (10min) | QA Session |
A more comprehensive list of papers are listed in our survey (A Survey on Efficient Inference for Large Language Models), below we list the papers that are referenced or discussed in our tutorial.