← System Design Distributed Coordination
System Design

Configuration Management at Scale

Centralized configuration stores must be backed by local disk caching to survive control plane outages.

TL;DR
  • Centralized configuration stores must be backed by local disk caching to survive control plane outages.
  • Push-based propagation minimizes lag but can trigger cascading outages if bad config is pushed globally.
  • Feature flags are operational levers that must be decoupled from code deployments and audited continuously.
  • Automated configuration drift detection is critical to ensure running instances match the declared target state.

The Problem

In a large-scale distributed system with thousands of microservice instances, managing configuration files locally on each server is impossible. If configuration is baked directly into container images, a simple change (like updating a database connection string or toggling a feature flag) requires a full, slow CI/CD deployment cycle.

Conversely, if instances fetch their configuration dynamically from a central database on every request, that database becomes a single point of failure and a massive performance bottleneck. When a configuration change is pushed, propagation lag can cause different instances to run with mismatched settings simultaneously, leading to routing errors, data corruption, or cascading system failures.

Core System Idea

Scalable configuration management requires a centralized authority with decentralized execution. The architecture separates the Control Plane (where configurations are defined and stored) from the Data Plane (the running application instances).

The core pattern involves a highly available, consensus-backed configuration store (like etcd or Consul) coupled with local configuration agents running on each application host. When an operator updates a configuration, the central store generates a new versioned snapshot.

Instead of applications polling the central store directly, the local agents subscribe to push-based change notifications (using long-polling or WebSockets).

Upon receiving an update, the agent writes the new configuration to a local disk cache, validates the schema, and then signals the application to reload the settings in-memory (e.g., via a file watcher or an internal API endpoint). If the central configuration store goes down, the application continues to run normally using its local disk-cached configuration, ensuring high availability.

System Flow

flowchart TD A[Operator] -->|1. Commit Config Change| B(Central Config Store) B -->|2. Push Change Event| C(Local Agent) C -->|3. Fetch New Config| B C -->|4. Validate Schema| C C -->|5. Write to Local Cache| D(Local Disk Cache) C -->|6. Signal Reload| E(Application Instance) E -->|7. Read Config| D

The local agent fetches, validates, and caches configuration changes locally before signaling the application to reload, isolating the application from central store outages.

Real-World Examples Indicative

LaunchDarkly — SSE streaming, <200ms flag propagation to 10M+ SDK instances, 20B+ evaluations/day

LaunchDarkly's SDK architecture uses Server-Sent Events (SSE) to push feature flag updates to server-side SDKs in real time. When an operator toggles a flag in the LaunchDarkly dashboard, the change propagates to LaunchDarkly's edge network and streams to all connected SDK instances within 200ms P99. Each SDK maintains an in-memory flag store populated by the SSE stream, with a local fallback disk cache for resuming after SSE reconnection. At their 2023 scale of 20B+ flag evaluations/day across 10M+ connected SDK instances, LaunchDarkly uses a Redis-backed Pub/Sub fanout layer to broadcast flag changes from a single operator action to all regional streaming servers — avoiding the central configuration database becoming the broadcast bottleneck during large customer deployments.

Netflix Archaius — dynamic Hystrix circuit breaker threshold tuning at runtime, 60s DynamoDB polling across 700+ instances

Netflix Archaius is a configuration library that polls configuration sources (AWS DynamoDB, local properties files, and Netflix's internal "Persisted Properties" service) on a configurable interval (default 60 seconds). During the 2015 Netflix Christmas Day outage, engineers used Archaius to dynamically reduce Hystrix circuit breaker requestVolumeThreshold values at runtime — from 20 requests per 10 seconds to 5 — without any code deployment. This caused overloaded microservices to trip circuit breakers faster, shedding cascading load. The Archaius property change propagated to 700+ microservice instances within 60 seconds via DynamoDB polling, a critical operational lever unavailable to teams that hard-coded Hystrix configuration in YAML files requiring full deployments to change.

Meta Configerator — proto3-defined configs, BitTorrent-style distribution to 2M+ servers, canary-then-fleet rollout

Meta's Configerator system stores all configurations as Protocol Buffer (proto3) schemas, enforcing strict type-safety and preventing silent corruption via free-form YAML. When an engineer commits a config change, Configerator's CI pipeline validates the serialized proto against a test cluster before any production push. The validated config distributes to Meta's 2M+ production servers using a BitTorrent-inspired peer-to-peer protocol: the central store seeds to ~100 "super-peer" servers per datacenter, which distribute to all local servers within 60 seconds — avoiding the bandwidth bottleneck of 2M+ simultaneous direct fetches from a central source. All high-risk changes use a two-phase rollout (1% canary, then full fleet) with automated rollback triggered if error rate increases above threshold.

Anti-Patterns

Hard-Restarting Apps for Config Changes

Requiring a full process restart to apply minor configuration updates, which causes unnecessary traffic drops and service instability.

Pushing Unvalidated Configurations

Allowing raw JSON or YAML files to be pushed to production without automated schema validation, leading to application crashes due to typos or missing keys.

Global Instant Rollouts

Propagating a configuration change to 100% of production instances simultaneously. If the configuration is bad, it will trigger a catastrophic, global outage instantly.

Baking Secrets into Config Files

Storing plaintext passwords, API keys, or database credentials in standard configuration repositories instead of using dedicated secret managers (like HashiCorp Vault).

Design Tradeoffs

DimensionPush-Based PropagationPull-Based Polling
Propagation latencyNear-instantaneous; the entire fleet receives configuration updates within seconds of a commit to the central storeDelayed; each instance only receives updates when its polling interval elapses — typically 30 seconds to 5 minutes
Control-plane complexityHigh; the config store must maintain persistent connections (SSE/WebSockets) to thousands of agent instancesLow; agents make standard stateless HTTP requests on a schedule — no persistent connection management required
Bad-config blast radiusHigh; a misconfiguration is pushed to the entire fleet within seconds, potentially crashing all instances simultaneouslyLow; changes propagate gradually as instances poll — early monitoring can catch bad configs before they reach the full fleet

Best Practices

Implement Multi-Stage RolloutsTreat configuration changes exactly like code deployments. Roll out configuration updates progressively: first to a canary instance, then to a single cluster, and finally to the global fleet.
Enforce Strict Schema ValidationDefine strict JSON Schema or Protocol Buffer definitions for all configurations. Validate every change at the build/commit stage and reject invalid configurations before they reach the store.
Decouple Feature Flags from ConfigUse lightweight, dedicated feature flagging systems for business logic toggles, and reserve heavy configuration management systems for infrastructure settings (IPs, timeouts, pool sizes).
Automate Drift DetectionRun background processes to continuously compare the running configuration of active instances against the declared target state in the central repository, alerting on any manual, out-of-band changes.

When to Use / Avoid

Use WhenAvoid When
Managing large-scale microservice architectures where timeouts, rate limits, and routing tables must be adjusted dynamically without code deploys.Running a small, monolithic application with a handful of instances where standard environment variables are sufficient.
Implementing continuous delivery pipelines that rely heavily on feature flags to decouple code releases from feature activation.The network environment is highly restricted, and instances cannot maintain outbound connections to a central configuration authority.
Operating in highly dynamic cloud environments where downstream dependencies change their IP addresses and ports frequently.Configuration settings are completely static and only change during major, scheduled software release windows.