Ray Serve and Distributed Serving
Ray Serve orchestrates complex, distributed inference graphs, specifically designed for multi-node and multi-GPU deployments.
Source: mortalapps.com- Ray Serve orchestrates complex, distributed inference graphs, specifically designed for multi-node and multi-GPU deployments.
- Its core purpose is enabling seamless horizontal scaling and the precise topological placement of serving actors across a diverse cluster.
- The primary optimization idea relies on "Placement Groups" to logically reserve and strictly colocate GPU and CPU resources before deploying models.
- The most important engineering insight is avoiding internal resource allocation conflicts when using underlying engines (like vLLM) that attempt to spawn their own distributed actors inside Ray's existing reservations.
Why This Matters
When an advanced LLM exceeds the VRAM capacity of a single GPU, it mandates Tensor Parallelism (TP) or Pipeline Parallelism (PP) spanning multiple GPUs or even multiple nodes. Orchestrating this topology at scale requires a higher-level engine capable of intelligently scheduling components, managing ingress routing, and maintaining strict Service Level Agreements (SLAs) for complex mixture-of-experts (MoE) architectures like DeepSeek or Qwen via Wide Expert Parallelism (Wide-EP). Ray Serve handles this orchestration layer.
Core Intuition
Think of Ray as a highly distributed Python execution engine, essentially an operating system for cluster computing. Ray Serve layers API ingress, traffic routing, and autoscaling logic on top of this foundation. To ensure that a distributed LLM runs efficiently without latency spikes, Ray utilizes Placement Groups. A placement group acts as a binding contract, guaranteeing that the required resources (for example, four GPUs strictly located across two specific nodes) are reserved and scheduled together. If any individual actor within this group fails, the placement group architecture can ensure the entire coordinated group restarts safely and simultaneously, preventing split-brain scenarios.
Technical Deep Dive
Ray manages cluster resources logically rather than at the hardware driver level. When configuring an engine like vLLM with Ray Serve for Tensor Parallelism (e.g., tensor_parallel_size=2), Ray provisions a primary actor embedded with a placement group. However, vLLM's internal EngineCore natively attempts to spawn its own Ray workers to handle the parallel execution. This creates a severe nested resource conflict: the outer Ray Serve placement group successfully holds the GPUs, but the inner vLLM engine fails to initialize because it perceives no remaining unallocated resources in the cluster. Resolving this conflict is central to distributed serving stability.