← System Design Backend Architectures
System Design

Zero Trust Architecture

Zero Trust replaces perimeter-based security with a "never trust, always verify" model for every request.

TL;DR
  • Zero Trust replaces perimeter-based security with a "never trust, always verify" model for every request.
  • Mutual TLS (mTLS) secures inter-service communication, verifying both client and server identities.
  • Policy Enforcement Points (PEPs) must be decoupled from application code to ensure consistent authorization checks.
  • Automated certificate rotation and dynamic identity management are critical to managing the operational overhead of Zero Trust.

The Problem

Traditional security relies on a "castle-and-moat" model: once a request passes the external firewall, everything inside the internal network is trusted. If an attacker breaches the perimeter—via a compromised dependency, a phishing attack, or an SSRF vulnerability—they gain unrestricted access to the entire internal network. They can move laterally, query database instances, and call internal APIs completely unchecked, resulting in massive data breaches.

Core System Idea

Zero Trust Architecture operates on the assumption that the internal network is hostile. No user, device, or service is trusted by default, regardless of its physical or logical location. Every single request must be authenticated, authorized, and encrypted before access is granted.

At the service level, this is achieved using Mutual TLS (mTLS). Unlike standard TLS where only the client verifies the server, mTLS requires both the client and server to present cryptographically signed certificates to verify each other's identity.

Authorization is handled by separating the Policy Decision Point (PDP) from the Policy Enforcement Point (PEP). The application or its sidecar proxy (PEP) intercepts the request, queries a centralized engine (PDP) to check if Service A is allowed to call Service B with specific parameters, and enforces the decision.

System Flow

flowchart TD ServiceA[Service A] -->|1. mTLS Handshake| SidecarA[Sidecar Proxy A] SidecarA -->|2. Encrypted Tunnel| SidecarB[Sidecar Proxy B] SidecarB -->|3. Authz Request| PDP[Policy Decision Point] PDP -->|4. Allow or Deny| SidecarB SidecarB -->|5. If Allowed| ServiceB[Service B]

Sidecar proxies intercept inter-service traffic, establish an encrypted mTLS tunnel, and validate authorization decisions against a Policy Decision Point.

Real-World Examples Indicative

Google BeyondProd

Every Google inter-service RPC carries a LOAS (Low Overhead Authentication System) token—a 10-minute cryptographically signed credential representing the calling service's workload identity. The rpc-security sidecar validates the token against a policy table before forwarding the call. Even traffic on Google's internal fiber backbone requires mTLS, and chain-of-custody headers in each RPC capture privilege-escalation paths for audit. BeyondProd eliminated the concept of a "trusted internal zone" entirely—no service is granted access by network location alone.

Cloudflare Access

In 2019, Cloudflare replaced its corporate VPN for 1,500+ employees using its own Access product. Every request to an internal application is checked against Okta SSO identity, Cloudflare Gateway device posture (OS version, disk encryption status), and geolocation. Authenticated requests carry a signed CF-Access-JWT-Assertion header that internal apps verify locally—no VPN tunnel is required. Remote access is ~25% faster than the previous VPN due to Anycast routing to the nearest Cloudflare PoP rather than backhauling traffic to a central VPN server.

HashiCorp Vault Dynamic Credentials

Vault's database secrets engine issues PostgreSQL credentials with a 1-hour TTL. When the lease expires, Vault revokes the credentials at the database level via REVOKE. The PKI secrets engine issues TLS certificates with 24-hour TTLs, rotated automatically by the vault agent sidecar. Applications never store long-lived credentials—the sidecar injects short-lived secrets into environment variables at startup and handles renewal transparently, with revocation propagating to all consumers within seconds.

Anti-Patterns

IP-Based Authorization

Restricting access to internal services based on IP addresses or CIDR blocks, which are easily spoofed and highly fragile in dynamic, auto-scaling cloud environments.

Manual Certificate Management

Attempting to manage mTLS certificates manually or setting long expiration times (e.g., 1 year), which makes revocation impossible and increases the risk of credential leaks.

Hardcoded Credentials

Storing API keys, database passwords, or private keys in source code or configuration files instead of using dynamic, short-lived credentials from a secrets engine.

Ignoring Lateral Movement

Securing the external API Gateway but leaving internal databases and message queues completely unauthenticated and unencrypted.

Design Tradeoffs

DimensionZero Trust (mTLS + Policy Engine)Perimeter Security (VPN + Firewall)
Blast radiusCompromised service is isolated; workload-scoped mTLS certificates prevent lateral movement to other servicesSingle perimeter breach exposes the entire internal network; attackers move freely between services
Latency overhead1-5ms added per RPC for mTLS handshake and policy evaluation (policy results are cached); ~0.5ms ongoingNear-zero internal latency; trusted packets are routed directly without per-request authentication checks
Operational costRequires automated CA (SPIRE), policy engine (OPA), and certificate rotation pipelines across every serviceSimple firewall rules and a VPN gateway; standard networking tooling suffices

Best Practices

Automate Certificate RotationUse SPIFFE/SPIRE or HashiCorp Vault to issue short-lived certificates (valid for hours, not years) and rotate them automatically without service downtime.
Decouple Policy from CodeUse Open Policy Agent (OPA) or service mesh authorization policies to define access rules declaratively, keeping application code free of security logic.
Enforce Least PrivilegeDefine strict, granular ACLs—e.g., Service A can only perform GET on /users/:id of Service B, and nothing else.
Encrypt Data in Transit and at RestEnsure all inter-service traffic is encrypted via mTLS, and all databases use transparent data encryption (TDE).
Log and Audit EverythingCollect detailed telemetry on all access attempts, both successful and denied, and feed them into a centralized SIEM for real-time anomaly detection.

When to Use / Avoid

Use WhenAvoid When
You operate in highly regulated industries (finance, healthcare) with strict compliance requirements.You are a seed-stage startup focused on rapid prototyping and finding product-market fit.
You run a highly distributed microservices architecture across multiple cloud providers or hybrid environments.You run a simple monolithic application inside a single, secure VPC with minimal external integrations.
You have a large, remote workforce accessing internal systems from various devices and networks.Your application has ultra-low latency requirements (e.g., high-frequency trading) where cryptographic overhead is unacceptable.