EMNLP 2025 Tutorial: Efficient Inference for Large Language Models

EMNLP 2025 Tutorial:
Efficient Inference for Large Language Models – Algorithm, Model, and System

Tsinghua University, Shanghai Jiao Tong University, Huawei Technologies, Infinigence AI

09:00 - 12:30 Tutorial 1

Saturday, November 8th, 2025.

Room : A104–A105

Slide of this tutorial: [Slide].

About this tutorial

This tutorial focuses on the increasingly important topic of Efficient Inference for LLMs and aims to provide a systematic understanding of key facts and methodologies from a designer’s perspective. We start by introducing the fundamental concepts and mechanisms of modern LLMs, along with the relevant software and hardware. Following this, we formally define the efficiency optimization problem. To equip the audience with a designer’s mindset, we will explain how to diagnose efficiency bottlenecks for a given workload on specific hardware. In particular, we will demonstrate how to use the basic theoretical roofline model and the NVIDIA toolchain to identify these bottlenecks.

With the tools at our disposal, we will begin with a conceptual analysis of the key factors contributing to inefficiency, namely the autoregressive sampling scheme, model size,and the core attention operator. Next, we will introduce our full-stack taxonomy of efficient inference methods for LLMs. we classify efficient inference methods into algorithm-, modeland system-level ones. (1) Algorithm-level optimization includes efficient decoding methods, input compression methods, as well as alternative generative paradigms beyond the autoregressive model. (2) Model-level optimization designs efficient model structures or cuts down model-level redundancy statically or dynamically. (3) System-level optimization optimizes the inference engine or the serving system without altering the model computation graph. We will walk through each category of methodology, using one to three representative methods as examples for each leaf subcategory, elaborating on the design logic behind each method and which inefficiency factors they primarily address. Finally, we will wrap up with a few demonstrations, a takeaway summary, and future research directions.

Schedule

Our tutorial will be held at A104–A105, 09:00 - 12:30, Saturday, November 8, 2025.

Time	Section	Presenter
9:00—9:45 (45min)	Part I: Background, Preliminary, Problem Definition and Analysis, Practical Pipeline	Ning, Xuefei
9:45—10:30 (45min)	Part II: Model-Level Optimization	Hou, Lu
10:30—11:00 (30min)	Coffee Break
11:00—11:40 (40min)	Part III: System-Level Optimization	Dai, Guohao
11:40—12:20 (40min)	Part IV: Algorithm-Level Optimization, Conclusion	BAI, Haoli
12:20—12:30 (10min)	QA Session

EMNLP 2025 Tutorial: Efficient Inference for Large Language Models – Algorithm, Model, and System

About this tutorial

Schedule

Reading List

Preliminary

Part II: Model-Level Optimization

Part III: System-Level Optimization

Part IV: Algorithm-Level Optimization

EMNLP 2025 Tutorial:
Efficient Inference for Large Language Models – Algorithm, Model, and System