DDSA Solutions
Case Study7 min read·

Design a Chat / Messaging System (WhatsApp / Slack DM)

System design for real-time chat: WebSockets vs polling, message storage, delivery guarantees, online presence, and group chat scaling.

Chat systems combine a classic CRUD problem (store messages) with a real-time delivery problem (get them to the right device now). Interviewers want to see you separate one-to-one chat from group chat, and to discuss what "delivered" and "read" actually mean. Start with the interview framework — clarify requirements before drawing WebSocket boxes everywhere.

Requirements

  • One-to-one and group conversations.
  • Send text messages; media as optional v2.
  • Delivery states: sent, delivered, read (read receipts).
  • Show online / last-seen presence.
  • Message history when user opens app (paginated).
  • Push notification when recipient is offline.

Clarify scale

Slack-style workplace chat (thousands per org) differs from WhatsApp-scale (billions of users). Ask daily active users, messages per second, and max group size. A 500-person group changes fan-out completely.

Capacity estimation

Assume 500M DAU, each sends 40 messages/day → 20B messages/day ≈ 230,000 writes/sec average, ~1M/sec peak. Storage: 200 bytes metadata per message ≈ 4TB/day raw before replication and compaction. Peak concurrent connections: if 20% of DAU online at once, 100M WebSockets — plan hundreds of gateway nodes at ~50K connections each. State these numbers before drawing boxes.

High-level architecture

  1. Mobile/web clients connect via WebSocket (or long polling fallback) to Chat Gateway.
  2. Gateway routes to the correct Chat Server instance based on user_id (sticky sessions).
  3. Message Service persists to Message DB and publishes to internal queue.
  4. Delivery Service pushes to recipient's WebSocket if online; else triggers push notification.
  5. Presence Service tracks online status in Redis with heartbeat TTL.
  6. Media Service handles uploads to object storage (out of scope for first 20 minutes).
ComponentTechnologyWhy
Chat GatewayWebSocket load balancerPersistent bidirectional connection
Message storeCassandra or partitioned SQLHigh write volume, time-ordered reads
PresenceRedis keys with TTLFast online checks; heartbeat refreshes expiry
PushFCM / APNsOffline users
QueueKafkaDecouple write from delivery fan-out

One-to-one message flow

  1. Alice sends message to Bob via WebSocket: { conv_id, text, client_msg_id }.
  2. Server validates membership, assigns server_msg_id, writes to messages table.
  3. Server ACKs Alice with server_msg_id (idempotent on client_msg_id retry).
  4. Delivery Service looks up Bob's connection on Chat Server #7.
  5. If online: push message over WebSocket; send delivered ACK to Alice when Bob's client ACKs.
  6. If offline: enqueue push notification; mark pending delivery.
  7. Bob opens app later: sync API returns messages since last cursor.

Idempotency

Clients retry on flaky networks. Store client_msg_id unique per sender and return the same server_msg_id on duplicate — same pattern as payment APIs.

Group chat

For a group of N members, each message creates N-1 deliveries. At 500 members and 10 msg/sec, that is 5,000 deliveries/sec per active group. Options:

  • Store message once per conversation_id; each user tracks read_cursor per conversation.
  • Fan-out on read: members pull new messages since their last sync token.
  • Fan-out on write for small groups (< 100); pull model for large channels.

Data model

  • conversations: conv_id, type (1:1 or group), created_at
  • conversation_members: conv_id, user_id, joined_at, last_read_msg_id
  • messages: msg_id, conv_id, sender_id, body, created_at (partition by conv_id + time)
  • presence: Redis key online:{user_id} with 30s TTL, refreshed by heartbeat

WebSocket vs polling

WebSockets give true push latency (milliseconds). Long polling works for MVP but wastes connections. In interviews, default to WebSocket + fallback polling for corporate firewalls. Mention connection scaling: millions of concurrent sockets need many gateway nodes and a pub/sub backbone (Redis Pub/Sub or dedicated message bus) so any server can reach any user. Chat gateways use load balancing with sticky sessions or a shared connection registry.

Connection registry

When Alice's message must reach Bob, the delivery service needs to know which chat server holds Bob's socket. Store a mapping in Redis: user_id → { server_id, connection_id } with TTL refreshed by heartbeat. On disconnect, delete the entry. This decouples delivery from sticky DNS — any server can look up where to push.

EventRegistry action
User connects WebSocketSET user:{id} → server_id, refresh TTL
Heartbeat every 15sEXPIRE user:{id} 30s
User disconnectsDEL user:{id}
Message for offline userRegistry miss → push notification queue

Multi-device and message ordering

Bob may be on phone and laptop simultaneously. Register multiple connections per user_id, or deliver to all active devices. Message ordering within a conversation uses a monotonic server_msg_id (Snowflake ID or DB sequence per conv_id). Clients discard duplicates and sort by server_msg_id. Cross-device sync uses the same paginated history API: GET /conversations/{id}/messages?after=cursor.

Storage and partitioning

Advertisement

Partition messages by conversation_id so all messages in one chat live on the same shard — range queries stay local. Cassandra uses conv_id as partition key; PostgreSQL can use hash partitioning on conv_id. See SQL vs NoSQL for why append-heavy chat logs favor wide-column stores at billion-message scale.

Delivery state machine

Clients show checkmarks based on server-confirmed states — define them precisely:

StateMeaningTrigger
SentServer stored messageACK to sender with server_msg_id
DeliveredRecipient device received payloadClient ACK over WebSocket or sync pull
ReadUser opened conversationPOST /conversations/{id}/read with last_read_msg_id

Do not mark delivered until the recipient client confirms — server push alone is not enough on flaky mobile networks.

Push notification path

  1. Delivery service finds no WebSocket registry entry for Bob.
  2. Enqueue push job: { user_id, conv_id, preview_text, badge_count }.
  3. Push worker calls FCM (Android) or APNs (iOS) with device token from user_devices table.
  4. Bob taps notification → app cold-starts → WebSocket connect → sync API fetches missed messages.
  5. Collapse multiple notifications per conversation to avoid notification spam.

API summary

EndpointMethodNotes
POST /v1/conversationsCreate 1:1 or group201 { conv_id }
POST /v1/conversations/{id}/messagesSend message201 + idempotent client_msg_id
GET /v1/conversations/{id}/messages?after=cursorHistory syncCursor pagination
POST /v1/conversations/{id}/readRead receipt204
GET /v1/presence?user_ids=...Batch online status200 { statuses }

Full REST conventions apply — version prefix, consistent errors, 429 on abuse.

Failure modes

FailureUser impactMitigation
Chat server crashBrief disconnect; client reconnectsExponential backoff reconnect; registry TTL expires stale entries
Message DB slowSend latency spikesQueue accepts write; async persist with client "sending" state
Push provider outageOffline users miss instant alertRetry queue; sync on next app open
Duplicate client_msg_id retryMust not show two messagesUnique constraint per sender; return original server_msg_id
Group fan-out overloadLarge channel lagPull model for members; rate limit posts per minute

Latency budget for send message

StepTarget
WebSocket receive + validate< 5ms
Persist to DB (async option)5–20ms sync, or ACK after queue
Registry lookup + push to recipient< 10ms if online
Delivered ACK back to sender< 30ms end-to-end p99

WhatsApp-feel latency requires online delivery over WebSocket, not polling. Offline path optimises for reliability over speed.

Optional v2 features

  • Typing indicators: ephemeral events over WebSocket, not persisted.
  • End-to-end encryption: keys on device; server stores ciphertext only — major scope addition.
  • Message search: Elasticsearch index async from the message queue.
  • Media messages: pre-signed S3 upload URL, then message body references object key.

What to say in the last five minutes

Summarise: "WebSocket gateway with connection registry in Redis, durable message store partitioned by conversation, async delivery with push fallback, idempotent client_msg_id on send. Small groups fan-out on write; large channels pull on read. Delivery and read states are explicit ACKs, not guesses." That is a complete interview answer without over-building.

Mock interview checklist

  1. Clarified 1:1 vs group and max group size.
  2. Separated write path (persist) from delivery path (push).
  3. Explained WebSocket scaling (registry or sticky sessions).
  4. Defined sent / delivered / read semantics.
  5. Mentioned offline push and history sync API.
  6. Discussed idempotency on flaky mobile networks.

How this connects to DSA

Message ordering within a conversation is a total order problem. Group delivery is BFS fan-out at scale. Presence TTL is sliding-window expiry — same intuition as rate limiting.

Closing summary

Lead with WebSocket gateway, persistent message store, async delivery, presence in Redis, and push for offline. Separate 1:1 from group fan-out strategy. Discuss idempotency and delivery ACKs — that is what separates a diagram from a production design.

More in this series