Multi-Model GPU Serving
Multi-Model GPU serving enables the concurrent execution of hundreds of fine-tuned adapters (LoRAs) over a single, shared base model footprint.
Source: mortalapps.com- Multi-Model GPU serving enables the concurrent execution of hundreds of fine-tuned adapters (LoRAs) over a single, shared base model footprint.
- Its core purpose is to minimize VRAM overhead and maximize compute density for highly customized LLM SaaS deployments.
- The primary optimization idea centers on sophisticated memory paging for adapters (S-LoRA) and specialized Segmented Gather Matrix-Vector (SGMV) kernels (Punica).
- The most important engineering insight is the ability to dynamically gather the correct LoRA weights for each individual request within a heavily batched matrix multiplication, completely avoiding the need to un-batch requests.
Why This Matters
Hosting independent, dedicated instances of a massive 70B parameter model for every individual customer's fine-tune is financially ruinous. Because Low-Rank Adaptation (LoRA) injects only a tiny fraction of total weights (often less than 50MB) on top of the base model, multi-model serving architectures allow an infrastructure provider to hold the massive base model in VRAM exactly once. The system can then hot-swap customer-specific adapters on a per-request basis, fundamentally transforming the unit economics of AI SaaS from loss-leading to highly profitable.
Core Intuition
Think of the base LLM as an automobile engine block, and the LoRA adapters as different electronic tuning profiles. In standard execution, you must permanently flash the profile into the block, forcing you to run one specific profile at a time. S-LoRA and Punica completely break this hardware restriction. They allow a single massive batch of inputs—originating from different users wanting different tuning profiles—to hit the base model together. A specialized CUDA kernel then dynamically applies the specific, unique LoRA correction to each sequence's hidden states precisely within that same batch, without ever separating them.
Technical Deep Dive
High-performance Multi-LoRA serving relies entirely on two architectural pillars. The first is Memory Paging (S-LoRA). Instead of attempting to store all adapters contiguously in VRAM, adapter weights are managed logically in a LoraPagePool. This allows the dynamic loading and eviction of adapters from host CPU memory to GPU VRAM based on the flow of incoming requests, highly analogous to how Paged KV Caching operates. The second pillar is the execution of SGMV Kernels (Punica). The Punica kernel accepts a single massive input tensor representing the hidden states for the entire batch. It utilizes a "segment" vector that mathematically defines which rows belong to which specific request. In a single, highly optimized kernel launch, it gathers the specific LoRA matrices for each segment and computes the low-rank multiplication across the entire batch simultaneously, maintaining peak SM utilization.