Design a Notification System (Push, Email, SMS, In-App)
System design for notifications at scale: multi-channel delivery, templates, user preferences, queues, retries, and idempotency for interview prep.
Almost every product sends notifications — order shipped, friend request, password reset. Interviewers use this prompt to test queues, fan-out, third-party providers, and failure handling without building a full chat system. Start with the interview framework: clarify channels, volume, and whether delivery must be exactly-once or at-least-once.
Requirements
Functional
- Send notifications via push (mobile), email, SMS, and in-app inbox.
- Support templates with variables: "Hi {{name}}, your order {{id}} shipped."
- Users set per-channel preferences (marketing off, security alerts on).
- Track delivery status: queued, sent, failed, opened (optional).
- Schedule notifications for future delivery.
Non-functional
- Handle 1M notifications per day; burst to 10K per minute.
- Push and email latency under 30 seconds p99 for transactional alerts.
- At-least-once delivery with idempotency to avoid duplicate charges or spam.
- Third-party providers (SendGrid, FCM, Twilio) can fail or rate-limit.
Clarify priority
Password reset is high priority; marketing digest is low. Ask if you need priority queues. Most interviews accept two tiers: transactional (immediate) and bulk (batched).
High-level architecture
| Component | Role |
|---|---|
| Notification API | Accept send requests from other services |
| Template service | Store and render templates |
| Preference service | User opt-in per channel and category |
| Message queue (Kafka/SQS) | Decouple producers from delivery workers |
| Channel workers | Push worker, email worker, SMS worker |
| Provider adapters | FCM, APNs, SendGrid, Twilio |
| Status store | PostgreSQL or DynamoDB for delivery logs |
| In-app inbox | PostgreSQL + optional Redis cache for unread count |
Send flow step by step
- Order service POST /v1/notifications { user_id, template_id, channel, payload, idempotency_key }.
- API validates idempotency_key — return 200 with same notification_id if duplicate.
- Load user preferences; skip channel if opted out (return accepted but not queued).
- Render template with payload variables.
- Publish event to Kafka topic notifications.{channel} with priority header.
- Worker consumes, calls provider API (FCM/SendGrid/etc.).
- On success: update status = sent. On failure: retry with exponential backoff; dead-letter after N tries.
- In-app channel writes row to inbox table; push/email/SMS skip inbox or mirror summary.
Idempotency and deduplication
Producers retry on network failure. Store idempotency_key → notification_id in Redis with 24h TTL (same pattern as API design payment flows). Workers can also dedupe by (user_id, template_id, event_id) within a time window to prevent double password-reset emails from duplicate upstream events.
Templates and localization
Templates live in DB: template_id, channel, locale, subject/body with {{placeholders}}. Render server-side before enqueue — never trust client HTML for email (XSS). For 10 locales, store 10 rows per template or use a CMS. Version templates so old queued jobs reference template_version at enqueue time.
Scaling workers
Each channel scales independently behind its own consumer group. Email provider limits 100/sec — scale workers but respect provider rate limits with a token bucket in the worker. Push scales higher; SMS is expensive — batch where possible. Use load balancing for stateless API and worker fleets.
Data model sketch
- notifications: id, user_id, channel, template_id, status, created_at, sent_at
- user_preferences: user_id, channel, category, enabled
- templates: id, channel, locale, body, version
- in_app_inbox: id, user_id, title, body, read, created_at
Failure modes
| Failure | Mitigation |
|---|---|
| Provider 429 | Backoff + reduce worker concurrency |
| Invalid device token | Mark token dead; stop retrying push to that device |
| Queue backlog | Scale consumers; shed low-priority marketing first |
| Template render error | Fail fast; alert ops; do not send blank email |
Capacity estimation
1M notifications/day ≈ 12/sec average, ~100/sec peak. Push payload ~500 bytes → 50 KB/sec peak egress to FCM — trivial. Email HTML ~50 KB × 200K emails/day → storage for templates and logs, not bandwidth. Worker pool: if each worker sends 50/sec and peak is 5K/sec, need ~100 workers per channel with headroom. Metadata DB: 1M rows/day × 365 ≈ 400M rows/year — partition by created_at or archive to cold storage.
Priority queues
| Tier | Examples | Handling |
|---|---|---|
| P0 transactional | OTP, password reset, payment failed | Dedicated topic; max workers; no batching |
| P1 product | Order shipped, friend request | Standard queue; retry 3× |
| P2 marketing | Weekly digest, promotions | Low-priority topic; rate-limited; drop under load |
Latency budget
| Step | Target |
|---|---|
| API accept + idempotency check | < 20ms |
| Preference + template render | < 30ms |
| Enqueue to Kafka | < 10ms |
| Worker → provider (push) | < 5s p99 end-to-end |
User-facing API returns 202 Accepted quickly; delivery is async. Do not block HTTP on SendGrid response.
Provider abstraction
Wrap FCM, APNs, SendGrid behind a NotificationProvider interface. Swap vendors without changing workers. Store provider_message_id on success for support lookups. Circuit-breaker when provider error rate spikes — pause marketing, keep transactional on backup provider if configured.
Sample API contract
| Endpoint | Response |
|---|---|
| POST /v1/notifications | 202 { notification_id } |
| GET /v1/notifications/{id} | 200 { status, channel, sent_at } |
| GET /v1/users/{id}/preferences | 200 { channels: [...] } |
| PATCH /v1/users/{id}/preferences | 204 |
| GET /v1/inbox?cursor= | 200 paginated in-app messages |
Scheduled and digest notifications
Schedule: write row with send_at; cron scanner publishes to queue when due — same worker path. Daily digest: batch per user at 8am local time — shard users by timezone, enqueue one job per user with aggregated content. Avoid sending 1M jobs at midnight UTC; spread over the hour.
Push channel in depth
Mobile push requires device tokens per app install. Store user_devices: user_id, platform (iOS/Android), token, last_seen. On send, worker loads active tokens for user_id; calls FCM (Android) or APNs (iOS). Invalid token response → mark device dead. Users with three devices get three push attempts unless you collapse to one notification per logical event. Payload size limits (~4KB) — deep links only, not full email body.
Email and SMS specifics
| Channel | Gotcha | Mitigation |
|---|---|---|
| Bounces and spam complaints | Webhook from SendGrid; suppress bad addresses | |
| HTML rendering across clients | Test templates; inline CSS for v1 | |
| SMS | Cost per segment | Reserve for OTP and critical alerts only |
| SMS | Regulatory opt-in (TCPA, etc.) | Double opt-in stored in preferences |
How this differs from chat
Chat is bidirectional real-time with read receipts. Notifications are mostly one-way fire-and-forget (plus optional in-app inbox). Chat needs WebSocket; notifications need durable queues and provider adapters. You can mention both use message queues but chat optimizes latency to milliseconds; notifications optimize reliable delivery over seconds.
Transactional outbox (advanced)
If order DB commit and notification enqueue must be atomic: write order row + outbox row in same DB transaction. Separate relay process reads outbox, publishes to Kafka, marks row sent. Prevents "order saved but notification never queued" without distributed transactions. Mention if interviewer pushes on consistency between DB and queue.
Sample opening (first three minutes)
Interviewer: "Design a notification system." You: "Before I draw boxes — which channels matter for v1: push, email, SMS, in-app? Is this transactional only or marketing too? For scale, should I assume millions per day? I will assume at-least-once delivery with idempotency keys, async workers per channel, and preference checks before enqueue." That opening shows product sense and sets scope.
What to say in the last five minutes
Close with: "Async queue per channel, template rendering before enqueue, preference checks at API, idempotency keys for producers, retries with dead-letter queue, separate workers for push/email/SMS." Mention message queues if you discussed Kafka already.
Mock interview checklist
- Listed channels and asked about priority / volume.
- Drew API → queue → workers → providers.
- Explained idempotency and at-least-once semantics.
- Mentioned user preferences and template rendering.
- Discussed retries, DLQ, and provider rate limits.
Closing summary
Notifications are a queue-and-adapter problem: accept fast, deliver async, scale per channel, and never duplicate transactional messages. Tie back to caching for idempotency keys and inbox unread counts.