EMNLP 2025 Tutorial:
Efficient Inference for Large Language Models – Algorithm, Model, and System

Tsinghua University, Shanghai Jiao Tong University, Huawei Technologies, Infinigence AI

09:00 - 12:30 Tutorial 1

Saturday, November 8th, 2025.

Room : A104–A105
Slide of this tutorial: [Slide].

About this tutorial

This tutorial focuses on the increasingly important topic of Efficient Inference for LLMs and aims to provide a systematic understanding of key facts and methodologies from a designer’s perspective. We start by introducing the fundamental concepts and mechanisms of modern LLMs, along with the relevant software and hardware. Following this, we formally define the efficiency optimization problem. To equip the audience with a designer’s mindset, we will explain how to diagnose efficiency bottlenecks for a given workload on specific hardware. In particular, we will demonstrate how to use the basic theoretical roofline model and the NVIDIA toolchain to identify these bottlenecks.

With the tools at our disposal, we will begin with a conceptual analysis of the key factors contributing to inefficiency, namely the autoregressive sampling scheme, model size,and the core attention operator. Next, we will introduce our full-stack taxonomy of efficient inference methods for LLMs. we classify efficient inference methods into algorithm-, modeland system-level ones. (1) Algorithm-level optimization includes efficient decoding methods, input compression methods, as well as alternative generative paradigms beyond the autoregressive model. (2) Model-level optimization designs efficient model structures or cuts down model-level redundancy statically or dynamically. (3) System-level optimization optimizes the inference engine or the serving system without altering the model computation graph. We will walk through each category of methodology, using one to three representative methods as examples for each leaf subcategory, elaborating on the design logic behind each method and which inefficiency factors they primarily address. Finally, we will wrap up with a few demonstrations, a takeaway summary, and future research directions.

Schedule

Our tutorial will be held at A104–A105, 09:00 - 12:30, Saturday, November 8, 2025.

Time Section Presenter
9:00—9:45 (45min) Part I: Background, Preliminary, Problem Definition and Analysis, Practical Pipeline Ning, Xuefei
9:45—10:30 (45min) Part II: Model-Level Optimization Hou, Lu
10:30—11:00 (30min) Coffee Break
11:00—11:40 (40min) Part III: System-Level Optimization Dai, Guohao
11:40—12:20 (40min) Part IV: Algorithm-Level Optimization, Conclusion BAI, Haoli
12:20—12:30 (10min) QA Session

Reading List

A more comprehensive list of papers are listed in our survey (A Survey on Efficient Inference for Large Language Models), below we list the papers that are referenced or discussed in our tutorial.


Preliminary


Part II: Model-Level Optimization


Part III: System-Level Optimization


Part IV: Algorithm-Level Optimization