Design a Chat / Messaging System (WhatsApp / Slack DM)
System design for real-time chat: WebSockets vs polling, message storage, delivery guarantees, online presence, and group chat scaling.
Chat systems combine a classic CRUD problem (store messages) with a real-time delivery problem (get them to the right device now). Interviewers want to see you separate one-to-one chat from group chat, and to discuss what "delivered" and "read" actually mean. Start with the interview framework — clarify requirements before drawing WebSocket boxes everywhere.
Requirements
- One-to-one and group conversations.
- Send text messages; media as optional v2.
- Delivery states: sent, delivered, read (read receipts).
- Show online / last-seen presence.
- Message history when user opens app (paginated).
- Push notification when recipient is offline.
Clarify scale
Slack-style workplace chat (thousands per org) differs from WhatsApp-scale (billions of users). Ask daily active users, messages per second, and max group size. A 500-person group changes fan-out completely.
Capacity estimation
Assume 500M DAU, each sends 40 messages/day → 20B messages/day ≈ 230,000 writes/sec average, ~1M/sec peak. Storage: 200 bytes metadata per message ≈ 4TB/day raw before replication and compaction. Peak concurrent connections: if 20% of DAU online at once, 100M WebSockets — plan hundreds of gateway nodes at ~50K connections each. State these numbers before drawing boxes.
High-level architecture
- Mobile/web clients connect via WebSocket (or long polling fallback) to Chat Gateway.
- Gateway routes to the correct Chat Server instance based on user_id (sticky sessions).
- Message Service persists to Message DB and publishes to internal queue.
- Delivery Service pushes to recipient's WebSocket if online; else triggers push notification.
- Presence Service tracks online status in Redis with heartbeat TTL.
- Media Service handles uploads to object storage (out of scope for first 20 minutes).
| Component | Technology | Why |
|---|---|---|
| Chat Gateway | WebSocket load balancer | Persistent bidirectional connection |
| Message store | Cassandra or partitioned SQL | High write volume, time-ordered reads |
| Presence | Redis keys with TTL | Fast online checks; heartbeat refreshes expiry |
| Push | FCM / APNs | Offline users |
| Queue | Kafka | Decouple write from delivery fan-out |
One-to-one message flow
- Alice sends message to Bob via WebSocket: { conv_id, text, client_msg_id }.
- Server validates membership, assigns server_msg_id, writes to messages table.
- Server ACKs Alice with server_msg_id (idempotent on client_msg_id retry).
- Delivery Service looks up Bob's connection on Chat Server #7.
- If online: push message over WebSocket; send delivered ACK to Alice when Bob's client ACKs.
- If offline: enqueue push notification; mark pending delivery.
- Bob opens app later: sync API returns messages since last cursor.
Idempotency
Clients retry on flaky networks. Store client_msg_id unique per sender and return the same server_msg_id on duplicate — same pattern as payment APIs.
Group chat
For a group of N members, each message creates N-1 deliveries. At 500 members and 10 msg/sec, that is 5,000 deliveries/sec per active group. Options:
- Store message once per conversation_id; each user tracks read_cursor per conversation.
- Fan-out on read: members pull new messages since their last sync token.
- Fan-out on write for small groups (< 100); pull model for large channels.
Data model
- conversations: conv_id, type (1:1 or group), created_at
- conversation_members: conv_id, user_id, joined_at, last_read_msg_id
- messages: msg_id, conv_id, sender_id, body, created_at (partition by conv_id + time)
- presence: Redis key online:{user_id} with 30s TTL, refreshed by heartbeat
WebSocket vs polling
WebSockets give true push latency (milliseconds). Long polling works for MVP but wastes connections. In interviews, default to WebSocket + fallback polling for corporate firewalls. Mention connection scaling: millions of concurrent sockets need many gateway nodes and a pub/sub backbone (Redis Pub/Sub or dedicated message bus) so any server can reach any user. Chat gateways use load balancing with sticky sessions or a shared connection registry.
Connection registry
When Alice's message must reach Bob, the delivery service needs to know which chat server holds Bob's socket. Store a mapping in Redis: user_id → { server_id, connection_id } with TTL refreshed by heartbeat. On disconnect, delete the entry. This decouples delivery from sticky DNS — any server can look up where to push.
| Event | Registry action |
|---|---|
| User connects WebSocket | SET user:{id} → server_id, refresh TTL |
| Heartbeat every 15s | EXPIRE user:{id} 30s |
| User disconnects | DEL user:{id} |
| Message for offline user | Registry miss → push notification queue |
Multi-device and message ordering
Bob may be on phone and laptop simultaneously. Register multiple connections per user_id, or deliver to all active devices. Message ordering within a conversation uses a monotonic server_msg_id (Snowflake ID or DB sequence per conv_id). Clients discard duplicates and sort by server_msg_id. Cross-device sync uses the same paginated history API: GET /conversations/{id}/messages?after=cursor.
Storage and partitioning
Partition messages by conversation_id so all messages in one chat live on the same shard — range queries stay local. Cassandra uses conv_id as partition key; PostgreSQL can use hash partitioning on conv_id. See SQL vs NoSQL for why append-heavy chat logs favor wide-column stores at billion-message scale.
Delivery state machine
Clients show checkmarks based on server-confirmed states — define them precisely:
| State | Meaning | Trigger |
|---|---|---|
| Sent | Server stored message | ACK to sender with server_msg_id |
| Delivered | Recipient device received payload | Client ACK over WebSocket or sync pull |
| Read | User opened conversation | POST /conversations/{id}/read with last_read_msg_id |
Do not mark delivered until the recipient client confirms — server push alone is not enough on flaky mobile networks.
Push notification path
- Delivery service finds no WebSocket registry entry for Bob.
- Enqueue push job: { user_id, conv_id, preview_text, badge_count }.
- Push worker calls FCM (Android) or APNs (iOS) with device token from user_devices table.
- Bob taps notification → app cold-starts → WebSocket connect → sync API fetches missed messages.
- Collapse multiple notifications per conversation to avoid notification spam.
API summary
| Endpoint | Method | Notes |
|---|---|---|
| POST /v1/conversations | Create 1:1 or group | 201 { conv_id } |
| POST /v1/conversations/{id}/messages | Send message | 201 + idempotent client_msg_id |
| GET /v1/conversations/{id}/messages?after=cursor | History sync | Cursor pagination |
| POST /v1/conversations/{id}/read | Read receipt | 204 |
| GET /v1/presence?user_ids=... | Batch online status | 200 { statuses } |
Full REST conventions apply — version prefix, consistent errors, 429 on abuse.
Failure modes
| Failure | User impact | Mitigation |
|---|---|---|
| Chat server crash | Brief disconnect; client reconnects | Exponential backoff reconnect; registry TTL expires stale entries |
| Message DB slow | Send latency spikes | Queue accepts write; async persist with client "sending" state |
| Push provider outage | Offline users miss instant alert | Retry queue; sync on next app open |
| Duplicate client_msg_id retry | Must not show two messages | Unique constraint per sender; return original server_msg_id |
| Group fan-out overload | Large channel lag | Pull model for members; rate limit posts per minute |
Latency budget for send message
| Step | Target |
|---|---|
| WebSocket receive + validate | < 5ms |
| Persist to DB (async option) | 5–20ms sync, or ACK after queue |
| Registry lookup + push to recipient | < 10ms if online |
| Delivered ACK back to sender | < 30ms end-to-end p99 |
WhatsApp-feel latency requires online delivery over WebSocket, not polling. Offline path optimises for reliability over speed.
Optional v2 features
- Typing indicators: ephemeral events over WebSocket, not persisted.
- End-to-end encryption: keys on device; server stores ciphertext only — major scope addition.
- Message search: Elasticsearch index async from the message queue.
- Media messages: pre-signed S3 upload URL, then message body references object key.
What to say in the last five minutes
Summarise: "WebSocket gateway with connection registry in Redis, durable message store partitioned by conversation, async delivery with push fallback, idempotent client_msg_id on send. Small groups fan-out on write; large channels pull on read. Delivery and read states are explicit ACKs, not guesses." That is a complete interview answer without over-building.
Mock interview checklist
- Clarified 1:1 vs group and max group size.
- Separated write path (persist) from delivery path (push).
- Explained WebSocket scaling (registry or sticky sessions).
- Defined sent / delivered / read semantics.
- Mentioned offline push and history sync API.
- Discussed idempotency on flaky mobile networks.
How this connects to DSA
Message ordering within a conversation is a total order problem. Group delivery is BFS fan-out at scale. Presence TTL is sliding-window expiry — same intuition as rate limiting.
Closing summary
Lead with WebSocket gateway, persistent message store, async delivery, presence in Redis, and push for offline. Separate 1:1 from group fan-out strategy. Discuss idempotency and delivery ACKs — that is what separates a diagram from a production design.