Load Balancing and Horizontal Scaling for Interviews
L4 vs L7 load balancers, round-robin vs consistent hashing, health checks, auto-scaling, and how to explain scaling a stateless API tier in system design interviews.
Almost every system design ends with "we put a load balancer in front." Interviewers want to know you understand what that actually does — and when horizontal scaling stops helping because state is stuck on one machine. This article covers the vocabulary and trade-offs you need, tied to designs like our rate limiter and URL shortener.
Vertical vs horizontal scaling
| Approach | How | Limit |
|---|---|---|
| Vertical (scale up) | Bigger CPU/RAM on one server | Hardware ceiling, single point of failure |
| Horizontal (scale out) | More identical servers behind a load balancer | Requires stateless app tier or shared state layer |
Start vertical for simplicity. Scale horizontally when CPU or connection count exceeds one box. Databases scale differently — read replicas, sharding — covered briefly below.
Load balancer responsibilities
- Distribute incoming requests across healthy backends.
- Terminate TLS (optional — also done at CDN edge).
- Health checks: remove unhealthy instances from the pool.
- Sticky sessions when needed (WebSocket, session in memory — avoid if possible).
- Single DNS entry for clients: api.example.com → LB → N app servers.
Layer 4 vs Layer 7
| Layer | Operates on | Use when |
|---|---|---|
| L4 (transport) | IP + TCP port | Raw throughput, WebSocket TCP proxy, gaming |
| L7 (application) | HTTP headers, path, cookies | Route /api to API fleet, /static to CDN, canary by header |
Most REST APIs use L7 (nginx, HAProxy, AWS ALB). High-connection chat gateways may combine L4 for connection distribution with L7 routing for HTTP APIs.
Load balancing algorithms
| Algorithm | Behaviour | Gotcha |
|---|---|---|
| Round robin | Each server gets next request | Ignores server load |
| Weighted round robin | More traffic to bigger instances | Good for mixed instance sizes |
| Least connections | Send to server with fewest open connections | Better for long-lived requests |
| Consistent hashing | Same key → same server | Used for caches and sharded data; minimizes remapping on node add/remove |
Consistent hashing connection
When you shard a cache or rate limiter across Redis nodes, consistent hashing keeps most keys on the same node when you add a shard. Mention this when discussing distributed rate limiting.
Stateless application tier
To scale app servers horizontally, session data must not live only in server RAM. Store sessions in Redis, use JWTs, or pass user context from API gateway. Any server can handle any request. This is the pattern behind most microservice fleets.
- Client → DNS → Load Balancer → App Server (any instance).
- App server reads/writes shared DB or cache.
- No affinity required unless WebSocket (then sticky by user_id or dedicated connection registry).
Auto-scaling
Cloud auto-scaling groups add/remove instances based on CPU, request rate, or queue depth. Scale-out is fast (minutes); scale-in should drain connections first. Mention cold start: new instances need warmup before taking full traffic — use health checks that verify app readiness, not just TCP listen.
Scaling the data layer
- Read replicas: scale reads; writes still go to primary.
- Sharding: partition data by user_id hash across DB nodes.
- Connection pooling: PgBouncer between thousands of app servers and limited DB connections.
- CDN: scale static assets and cacheable GET responses at the edge.
Failure modes
| Scenario | Mitigation |
|---|---|
| Single LB dies | DNS to multiple LBs or cloud-managed redundant LB |
| Thundering herd on scale-out | Jittered backoff, cache warming, gradual traffic shift |
| Uneven shard load | Resharding, virtual nodes in consistent hashing |
DNS and the entry point
Clients resolve api.example.com via DNS. DNS may return the load balancer VIP or multiple A records for geo routing. TTL matters: low TTL enables faster failover; high TTL reduces DNS load. The load balancer is the traffic cop; DNS is the address book. Mention both when drawing the first box in a diagram.
Health checks in depth
Liveness checks: is the process up? Readiness checks: can this instance accept traffic (DB connected, cache warmed)? Send readiness failures to drain an instance before deploy. Active health probes hit GET /health every 5s; passive checks mark instances unhealthy after N consecutive 5xx responses. This prevents routing to a server that listens on port 80 but cannot serve requests.
Deploying without downtime
- Rolling deploy: replace instances one at a time behind the LB.
- Blue-green: switch LB target group from old fleet to new fleet atomically.
- Canary: route 5% of traffic to new version; watch error rate before full cutover.
These patterns apply to the stateless tier in our news feed and chat designs — WebSocket gateways need connection draining on scale-in, which is harder than stateless HTTP.
When horizontal scaling does not help
Adding app servers does not fix a saturated primary database or a single hot Redis key. Identify the bottleneck first: CPU on app tier → scale out; disk I/O on DB → replicas, sharding, or cache; one partition key → resharding. Say this explicitly in interviews to avoid the "just add more servers" trap.
Worked example: scaling the URL shortener
Start with one app server and PostgreSQL. Traffic grows → add an L7 load balancer and N stateless app instances (URL shortener). Read QPS still spikes on redirects → add Redis cache-aside. DB reads saturate → add read replicas. Write QPS exceeds primary capacity → shard url_mappings by hash of short_code. Each step solves a measured bottleneck — that narrative is what interviewers want.
| Bottleneck signal | Scale lever |
|---|---|
| App CPU > 70% sustained | Horizontal scale behind LB |
| DB read latency on redirects | Redis + read replicas |
| DB write TPS on create | Sharding or async write queue |
| Single Redis memory limit | Redis Cluster with consistent hashing |
| Global latency | CDN for redirects + regional caches |
API gateway vs load balancer
A load balancer distributes traffic to identical app servers. An API gateway adds cross-cutting concerns: authentication, rate limiting, request validation, SSL termination, and routing /v1 vs /v2. In diagrams, gateway sits between LB and services, or LB and gateway merge into one managed product (AWS ALB + API Gateway). Say "gateway enforces auth and rate limits; LB balances healthy backends."
Sticky sessions and WebSockets
HTTP requests should be stateless — any server handles any request. WebSocket connections are stateful TCP pipes. Options: (1) sticky sessions by user_id cookie so the same user lands on the same chat server, (2) connection registry in Redis so any server can find any user's socket (chat design). Prefer (2) at scale; sticky sessions break when instances die mid-connection.
Capacity back-of-envelope
One modest app server handles ~1,000–5,000 simple HTTP req/sec depending on work per request. If peak is 50,000 req/sec and each server sustains 2,500, you need ~20 instances plus headroom for deploys. WebSocket servers are connection-bound: 50K concurrent sockets per box is a planning number — millions of users online means hundreds of gateway nodes. State these assumptions aloud in interviews.
Latency budget through the stack
| Hop | Typical cost |
|---|---|
| DNS lookup (cached) | 0–5ms |
| TLS handshake (first request) | 20–50ms |
| Load balancer | 1–5ms |
| App server logic | 2–20ms |
| Redis round-trip | 0.5–2ms |
| DB replica query (cache miss) | 5–20ms |
A cached redirect stays under 50ms p99. Uncached paths budget for DB. If your diagram has six network hops, call out which you would eliminate with edge cache.
What to say in the last five minutes
Summarise: "Stateless app tier behind L7 LB, auto-scale on CPU or RPS, shared Redis and DB for state, read replicas for read-heavy paths, consistent hashing if we shard caches. WebSockets need a connection registry instead of blind round-robin." That covers 90% of scale follow-ups.
Mock interview checklist
- Distinguished vertical vs horizontal scaling with a real trigger (CPU, connections).
- Named L7 for HTTP and explained when L4 matters.
- Described stateless tier and where state lives (Redis, DB).
- Mentioned health checks (liveness vs readiness).
- Explained what happens when scaling app servers does not fix DB saturation.
- Referenced a case study: URL shortener, feed, or rate limiter.
Closing summary
Say L7 LB for HTTP APIs, stateless app tier, shared Redis/DB for state, auto-scale on CPU or RPS, and consistent hashing when you shard. That one paragraph satisfies most "how do you scale this?" follow-ups.