AI Serving Infrastructure

Multi-Model GPU Serving

Multi-Model GPU serving enables the concurrent execution of hundreds of fine-tuned adapters (LoRAs) over a single, shared base model footprint.

Published June 1, 2026 · By MortalApps · 5 min read · ~929 words

TL;DR

Multi-Model GPU serving enables the concurrent execution of hundreds of fine-tuned adapters (LoRAs) over a single, shared base model footprint.
Its core purpose is to minimize VRAM overhead and maximize compute density for highly customized LLM SaaS deployments.
The primary optimization idea centers on sophisticated memory paging for adapters (S-LoRA) and specialized Segmented Gather Matrix-Vector (SGMV) kernels (Punica).
The most important engineering insight is the ability to dynamically gather the correct LoRA weights for each individual request within a heavily batched matrix multiplication, completely avoiding the need to un-batch requests.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Hosting independent, dedicated instances of a massive 70B parameter model for every individual customer's fine-tune is financially ruinous. Because Low-Rank Adaptation (LoRA) injects only a tiny fraction of total weights (often less than 50MB) on top of the base model, multi-model serving architectures allow an infrastructure provider to hold the massive base model in VRAM exactly once. The system can then hot-swap customer-specific adapters on a per-request basis, fundamentally transforming the unit economics of AI SaaS from loss-leading to highly profitable.

Core Intuition

Think of the base LLM as an automobile engine block, and the LoRA adapters as different electronic tuning profiles. In standard execution, you must permanently flash the profile into the block, forcing you to run one specific profile at a time. S-LoRA and Punica completely break this hardware restriction. They allow a single massive batch of inputs—originating from different users wanting different tuning profiles—to hit the base model together. A specialized CUDA kernel then dynamically applies the specific, unique LoRA correction to each sequence's hidden states precisely within that same batch, without ever separating them.

Technical Deep Dive

High-performance Multi-LoRA serving relies entirely on two architectural pillars. The first is Memory Paging (S-LoRA). Instead of attempting to store all adapters contiguously in VRAM, adapter weights are managed logically in a LoraPagePool. This allows the dynamic loading and eviction of adapters from host CPU memory to GPU VRAM based on the flow of incoming requests, highly analogous to how Paged KV Caching operates. The second pillar is the execution of SGMV Kernels (Punica). The Punica kernel accepts a single massive input tensor representing the hidden states for the entire batch. It utilizes a "segment" vector that mathematically defines which rows belong to which specific request. In a single, highly optimized kernel launch, it gathers the specific LoRA matrices for each segment and computes the low-rank multiplication across the entire batch simultaneously, maintaining peak SM utilization.

Key Takeaways

Multi-LoRA serving successfully decouples the massive base model memory footprint from the customization footprint.

S-LoRA manages adapter weights using highly efficient page pools, directly extending the paradigm of PagedAttention.

Punica SGMV kernels fuse the low-rank multiplications for completely heterogeneous requests into a single, high-efficiency GPU kernel launch.

Joint compression and intelligent clustering of LoRAs can further reduce serving costs and memory overhead, maximizing tenant density.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts