DDSA Solutions
Fundamentals6 min read·

Database Sharding and Replication for Interviews

How to shard relational and NoSQL databases: shard keys, consistent hashing, read replicas, failover, and rebalancing in system design interviews.

Single Postgres instance stops scaling around tens of thousands of writes per second. Sharding splits data across machines; replication copies it for read scale and failover. You will need both on almost every large case study — this article explains how to draw them without hand-waving.

Replication basics

  • Primary-replica (leader-follower): writes to primary; replicas async or sync replicate.
  • Read replicas: scale SELECT traffic; replication lag means stale reads.
  • Failover: promote replica to primary on leader death — risk of lost async writes.
  • Multi-primary: writes to any node — conflict resolution required (harder).

Interview default

Say single primary + N read replicas for v1. Mention sync replica for zero data loss on failover if the prompt is financial.

When to shard

Shard when vertical scale (bigger machine) fails or is too expensive. Signals: disk full, write IOPS maxed, backup restore too slow. SQL vs NoSQL: shard relational by tenant or user_id; Cassandra shards by partition key natively.

Choosing a shard key

Shard keyProsCons
user_idUser data colocated; no cross-shard user queries in v1Hot users create hot shards
tenant_idB2B isolation; easy per-customer exportWhale tenant overloads one shard
hash(user_id)Even spreadRange queries across users hard
geo regionData residency complianceUneven population density

Good shard key: high cardinality, even distribution, queries mostly include it. Bad: country alone (India shard huge), boolean flags.

Routing layer

Application or proxy (Vitess, Citus, custom) maps shard_key → physical shard. `shard = hash(user_id) % num_shards`. Store mapping in config service; bump num_shards only with rebalancing plan. Consistent hashing with virtual nodes reduces data movement when adding shards.

Cross-shard operations

  • Avoid in v1 — design around single-shard queries.
  • Global secondary index: index table on each shard or separate index service (expensive).
  • Scatter-gather: query all shards, merge — high latency; use for admin only.
  • Two-phase commit across shards — slow; avoid unless critical.

Worked example: sharded user posts

  1. Shard by user_id hash into 64 Postgres clusters.
  2. Post insert includes user_id — routes to one shard.
  3. Feed read for user A: fan-out table on A's shard only if followers stored per user.
  4. Celebrity with hot shard: split read replicas; cache; or dedicated shard override.

Rebalancing

Adding shards: consistent hashing moves only K/N keys. Dual-write old+new during migration; backfill; cut read traffic; drop old. Plan maintenance window or online migration tools. Never reshuffle without copy — downtime unacceptable at scale.

Read replica lag

Replica lag 100ms–seconds. After user writes profile, read-your-writes: route to primary or sticky session. Analytics on replicas OK with stale data. Monitor replication lag alert — lag spike means replica unfit for failover.

Capacity back-of-envelope

1 TB data, 64 shards ≈ 16 GB each — fits SSD comfortably. 10K writes/sec total ÷ 64 ≈ 156 writes/sec per shard — easy for Postgres. Hot shard problem: 1% users causing 50% traffic — detect via per-shard metrics; split hot key or add cache.

Failure modes

FailureMitigation
Primary downAuto failover to sync replica; brief write outage
Replica lag extremeStop routing reads to that replica
Shard unreachablePartial outage — only users on that shard affected
Wrong shard key choicePainful migration — clarify key early in design

Hot shard detection and mitigation

Advertisement
  • Per-shard QPS metrics — alert on 3× median.
  • Split hot shard into sub-shards (re-hash with salt prefix).
  • Aggressive cache for celebrity user rows.
  • Read replicas on hot shard only.

Tools interviewers recognize

Vitess (YouTube) shards MySQL with routing layer. Citus distributes Postgres. DynamoDB/Cassandra shard by partition key natively. CockroachDB offers geo-distributed SQL with tunable survival goals. Name one if relevant — then explain shard key, not logo dropping.

What to say in the last five minutes

Replication first for read scale; shard when writes or disk break single node. Shard key = user_id hash. No cross-shard joins in v1. Consistent hashing for growth. Read replicas with lag awareness.

Sample opening (first three minutes)

Interviewer: "Your database is the bottleneck." You: "I would start with read replicas for read-heavy traffic and monitor write headroom. If writes exceed single-node capacity, shard by user_id hash with a routing layer. I avoid cross-shard joins in v1 and plan consistent hashing so we can add shards without full reshuffle."

Replication lag math

Primary writes 5K/sec; async replica 100ms lag → replica may be 500 rows behind on read. For social timeline served from replica, user might miss own post for 100ms — fix with read-your-writes routing. For notification inbox, same pattern.

Shard count planning

FactorGuidance
Start small8–16 shards; double when per-shard CPU > 60% sustained
Too many shardsOps overhead; empty shards waste connections
Too fewHot spots; migration pain later
GrowthPlan 2× headroom; consistent hashing eases add

Backup and restore at shard scale

Per-shard backups parallelize restore time. Global logical backup needs coordinated point-in-time — harder. SQL vs NoSQL: Cassandra snapshots per node; Postgres pg_dump per shard. Test restore quarterly — interview mention shows ops maturity.

Global tables and reference data

Small reference tables (country codes, fare rules) replicate to every shard or live in separate global DB with cache. Avoid joining user shard to global on every query — cache reference data in app memory with TTL.

Worked example: [file storage](/system-design/design-file-storage-dropbox) metadata

File metadata (name, ACL) lives on shard by user_id — list folder is single-shard query. Blob bytes in S3 are not sharded with metadata — global object store. Cross-user share: sharee_id looks up file_id in global share index (small table) then routes to owner shard for ACL check. Say this when combining sharding with case studies.

Connection pooling per shard

64 shards × 100 app connections each = 6400 DB connections — use PgBouncer per shard or proxy layer. Connection storms on deploy — ramp slowly. Interviewers at FAANG care about this ops detail after you nail shard key.

Follower reads and staleness SLAs

Define explicit SLAs: "replica reads may lag 500ms" on product pages; "payment reads always from primary." Document in API contracts. Auto-failover promotes replica — verify sync replication or accept small loss window and say so aloud.

Mock interview checklist

  1. Explained primary-replica vs sharding purpose.
  2. Named shard key and justified even distribution.
  3. Mentioned read replica lag and read-your-writes.
  4. Avoided cross-shard transactions in v1.
  5. Described adding shards / rebalancing at high level.

Closing summary

Replication scales reads and adds failover; sharding scales writes and storage. Pick the shard key carefully — it is the hardest thing to change later.

More in this series