← System Design Distributed Coordination
System Design

Service Discovery

Service discovery dynamically maps logical service names to transient, ephemeral IP addresses in cloud environments.

TL;DR
  • Service discovery dynamically maps logical service names to transient, ephemeral IP addresses in cloud environments.
  • Client-side discovery reduces network hops but couples application code to registry client libraries and health-checking logic.
  • DNS-based discovery is simple but plagued by aggressive OS, JVM, and container-level caching that ignores low TTLs.
  • Rolling deployments require explicit connection draining and graceful deregistration to prevent blackholing user traffic.

The Problem

In modern containerized and cloud-native environments, application instances are highly ephemeral. Autoscaling groups, Kubernetes pods, and spot instances are constantly created, destroyed, and rescheduled with dynamic, unpredictable IP addresses.

If a microservice attempts to call another service using static configuration files or hardcoded IPs, connections will quickly fail as instances terminate. Conversely, if traffic is routed through a traditional static load balancer, that load balancer itself becomes a single point of failure and a performance bottleneck. The system needs a highly available, dynamic mechanism to track the real-time location and health of healthy service instances without manual intervention.

Core System Idea

Service discovery solves this by introducing a dynamic service registry—a highly available database containing the network locations of all active service instances. The architecture operates on three core components: Registration, Health Checking, and Querying.

When an instance boots, it registers its IP, port, and metadata with the service registry. The registry continuously monitors the instance's health via active polling or by requiring the instance to send periodic "heartbeat" keep-alives. If an instance fails a health check or stops heartbeating, the registry marks it as unhealthy and removes it from the active pool.

When a client service needs to call a downstream dependency, it queries the registry to obtain a list of healthy IP addresses. In Client-Side Discovery, the client queries the registry directly, caches the results locally, and uses a client-side load balancing library to distribute requests. In Server-Side Discovery, the client routes requests through a dedicated proxy (like an API gateway or internal load balancer) which queries the registry and forwards the traffic.

System Flow

flowchart TD A[Service Instance] -->|"1. Register IP & Port"| B(Service Registry) B -->|2. Active Health Check| A C[Client Service] -->|3. Query Healthy IPs| B B -->|4. Return IP List| C C -->|"5. Cache & Load Balance"| C C -->|"6. Direct HTTP/gRPC Call"| A

In a client-side service discovery pattern, the client queries the registry for healthy IPs, caches them, and routes traffic directly to the target instance.

Real-World Examples Indicative

HashiCorp Consul — gossip protocol convergence in 2s on 500-node cluster, 10s health-check deregistration

Consul uses the Serf gossip protocol for cluster membership and health propagation. Each node gossips to a random subset of ~3 peers every 200ms. In HashiCorp's 2023 production configuration for their Consul cloud service, a 500-node cluster achieves full cluster-state convergence within 2 seconds of a node failure via gossip propagation. Health checks run every 10 seconds with a 3-check deregistration threshold (30 seconds total before removal). During AWS us-east-1 availability zone failures in 2022, Consul's gossip-based failure detection removed unhealthy nodes from the service catalog within 35 seconds (10s check interval + 2 missed checks + ~5s gossip propagation), preventing load balancers from routing traffic to dead instances during the degraded AZ period.

Kubernetes CoreDNS — kube-proxy iptables DNAT, JVM DNS caching bypasses 30s TTL update

Kubernetes implements server-side service discovery using CoreDNS as the in-cluster DNS server. When a Pod resolves payment-service.default.svc.cluster.local, CoreDNS returns the ClusterIP — a stable virtual IP that never changes across Pod restarts. kube-proxy maintains iptables DNAT rules on every node to translate the ClusterIP to a healthy Pod IP using round-robin selection, updated within seconds of Pod termination. The critical operational pitfall: Java applications with the JVM's default networkaddress.cache.ttl=10 seconds cache DNS resolutions for up to 10 seconds after kube-proxy has already updated iptables. During rolling deployments, this means Java services continue resolving terminated Pod IPs for up to 10 seconds, causing connection refused errors until the JVM cache expires — not a CoreDNS issue, but a client-side DNS cache problem that requires explicitly setting -Dsun.net.inetaddr.ttl=0 in production JVM configurations.

Netflix Eureka — self-preservation mode at 85% heartbeat threshold prevents mass deregistration

Eureka implements an AP service registry where each server instance operates independently, accepting registrations and serving queries even when it cannot communicate with peer servers. Netflix's most critical production safeguard is Eureka's self-preservation mode: if the Eureka server stops receiving heartbeats from more than 15% of registered instances within a 90-second window, it assumes a network partition (not a mass instance failure) and freezes the registry — refusing to deregister any instances. In 2012, Netflix discovered without self-preservation mode, a 90-second network partition between Eureka and its clients caused the registry to deregister all 1,200+ microservice instances, blackholing 100% of inter-service traffic. Self-preservation mode was introduced specifically to prevent this failure mode, trading consistency (stale registrations) for availability (no mass blackout).

Anti-Patterns

Relying on Default JVM DNS Caching

Failing to override the Java Virtual Machine's default DNS caching policy, which historically cached IP resolutions infinitely, completely defeating DNS-based service discovery.

Hard-Killing Instances During Deploys

Terminating container instances abruptly without sending a deregistration signal to the registry, causing clients to route traffic to dead IPs for several seconds (blackholing).

Gossip Protocol Overload

Deploying thousands of nodes in a single flat gossip network without partitioning, causing control-plane network traffic to consume significant CPU and bandwidth.

Using Registry Queries as a Hard Dependency

Querying the central service registry synchronously on every single incoming API request instead of utilizing local client-side caching with background updates.

Design Tradeoffs

DimensionClient-Side DiscoveryServer-Side Discovery
Network hops per requestOne; the client queries the registry once, caches results, and routes directly to the target service instanceTwo; every request routes through a load balancer or proxy layer that internally queries the registry before forwarding
Client couplingHigh; every service must embed a registry-specific SDK, health-check logic, and load-balancing library for each programming languageLow; clients use standard HTTP/gRPC; all routing intelligence is centralized in the proxy layer, transparent to application code
Registry outage resilienceHigh; clients continue routing using their local in-memory cached IP lists during full registry downtimeLow; if the proxy or load balancer layer fails, all inter-service communication halts regardless of registry health

Best Practices

Implement Graceful Shutdown HooksEnsure application containers catch termination signals (SIGTERM), immediately deregister themselves from the service registry, and drain active connections before exiting.
Configure Low DNS TTLsIf using DNS-based discovery, set DNS Time-To-Live (TTL) values to 0 or a few seconds, and configure client runtimes to respect these limits.
Use Local Caching with Push UpdatesDesign clients to cache the service registry locally and subscribe to push-based updates (e.g., via long-polling or gRPC streams) to receive immediate notifications of topology changes.
Isolate Control Plane TrafficRun service registry traffic on a dedicated network interface or prioritize it using QoS rules to prevent application data traffic from starving health checks.

When to Use / Avoid

Use WhenAvoid When
You are running highly dynamic, containerized microservices on orchestrators where instances scale up and down frequently.You are running a static, monolithic application on a fixed set of virtual machines with permanent IP addresses.
You need advanced routing capabilities, such as canary deployments, blue-green deploys, or latency-based routing.The network architecture is simple, and a standard cloud load balancer (e.g., AWS ALB) can easily handle the traffic volume.
Implementing polyglot microservices where different teams use different programming languages (utilize sidecar proxies/service mesh).The overhead of managing a dedicated consensus-based registry cluster (like Consul or etcd) exceeds your team's operational capacity.