DDSA Solutions
Case Study7 min read·

Design a Notification System (Push, Email, SMS, In-App)

System design for notifications at scale: multi-channel delivery, templates, user preferences, queues, retries, and idempotency for interview prep.

Almost every product sends notifications — order shipped, friend request, password reset. Interviewers use this prompt to test queues, fan-out, third-party providers, and failure handling without building a full chat system. Start with the interview framework: clarify channels, volume, and whether delivery must be exactly-once or at-least-once.

Requirements

Functional

  • Send notifications via push (mobile), email, SMS, and in-app inbox.
  • Support templates with variables: "Hi {{name}}, your order {{id}} shipped."
  • Users set per-channel preferences (marketing off, security alerts on).
  • Track delivery status: queued, sent, failed, opened (optional).
  • Schedule notifications for future delivery.

Non-functional

  • Handle 1M notifications per day; burst to 10K per minute.
  • Push and email latency under 30 seconds p99 for transactional alerts.
  • At-least-once delivery with idempotency to avoid duplicate charges or spam.
  • Third-party providers (SendGrid, FCM, Twilio) can fail or rate-limit.

Clarify priority

Password reset is high priority; marketing digest is low. Ask if you need priority queues. Most interviews accept two tiers: transactional (immediate) and bulk (batched).

High-level architecture

ComponentRole
Notification APIAccept send requests from other services
Template serviceStore and render templates
Preference serviceUser opt-in per channel and category
Message queue (Kafka/SQS)Decouple producers from delivery workers
Channel workersPush worker, email worker, SMS worker
Provider adaptersFCM, APNs, SendGrid, Twilio
Status storePostgreSQL or DynamoDB for delivery logs
In-app inboxPostgreSQL + optional Redis cache for unread count

Send flow step by step

  1. Order service POST /v1/notifications { user_id, template_id, channel, payload, idempotency_key }.
  2. API validates idempotency_key — return 200 with same notification_id if duplicate.
  3. Load user preferences; skip channel if opted out (return accepted but not queued).
  4. Render template with payload variables.
  5. Publish event to Kafka topic notifications.{channel} with priority header.
  6. Worker consumes, calls provider API (FCM/SendGrid/etc.).
  7. On success: update status = sent. On failure: retry with exponential backoff; dead-letter after N tries.
  8. In-app channel writes row to inbox table; push/email/SMS skip inbox or mirror summary.

Idempotency and deduplication

Producers retry on network failure. Store idempotency_key → notification_id in Redis with 24h TTL (same pattern as API design payment flows). Workers can also dedupe by (user_id, template_id, event_id) within a time window to prevent double password-reset emails from duplicate upstream events.

Templates and localization

Templates live in DB: template_id, channel, locale, subject/body with {{placeholders}}. Render server-side before enqueue — never trust client HTML for email (XSS). For 10 locales, store 10 rows per template or use a CMS. Version templates so old queued jobs reference template_version at enqueue time.

Scaling workers

Each channel scales independently behind its own consumer group. Email provider limits 100/sec — scale workers but respect provider rate limits with a token bucket in the worker. Push scales higher; SMS is expensive — batch where possible. Use load balancing for stateless API and worker fleets.

Data model sketch

  • notifications: id, user_id, channel, template_id, status, created_at, sent_at
  • user_preferences: user_id, channel, category, enabled
  • templates: id, channel, locale, body, version
  • in_app_inbox: id, user_id, title, body, read, created_at

Failure modes

FailureMitigation
Provider 429Backoff + reduce worker concurrency
Invalid device tokenMark token dead; stop retrying push to that device
Queue backlogScale consumers; shed low-priority marketing first
Template render errorFail fast; alert ops; do not send blank email

Capacity estimation

1M notifications/day ≈ 12/sec average, ~100/sec peak. Push payload ~500 bytes → 50 KB/sec peak egress to FCM — trivial. Email HTML ~50 KB × 200K emails/day → storage for templates and logs, not bandwidth. Worker pool: if each worker sends 50/sec and peak is 5K/sec, need ~100 workers per channel with headroom. Metadata DB: 1M rows/day × 365 ≈ 400M rows/year — partition by created_at or archive to cold storage.

Priority queues

TierExamplesHandling
P0 transactionalOTP, password reset, payment failedDedicated topic; max workers; no batching
P1 productOrder shipped, friend requestStandard queue; retry 3×
P2 marketingWeekly digest, promotionsLow-priority topic; rate-limited; drop under load
Advertisement

Latency budget

StepTarget
API accept + idempotency check< 20ms
Preference + template render< 30ms
Enqueue to Kafka< 10ms
Worker → provider (push)< 5s p99 end-to-end

User-facing API returns 202 Accepted quickly; delivery is async. Do not block HTTP on SendGrid response.

Provider abstraction

Wrap FCM, APNs, SendGrid behind a NotificationProvider interface. Swap vendors without changing workers. Store provider_message_id on success for support lookups. Circuit-breaker when provider error rate spikes — pause marketing, keep transactional on backup provider if configured.

Sample API contract

EndpointResponse
POST /v1/notifications202 { notification_id }
GET /v1/notifications/{id}200 { status, channel, sent_at }
GET /v1/users/{id}/preferences200 { channels: [...] }
PATCH /v1/users/{id}/preferences204
GET /v1/inbox?cursor=200 paginated in-app messages

Scheduled and digest notifications

Schedule: write row with send_at; cron scanner publishes to queue when due — same worker path. Daily digest: batch per user at 8am local time — shard users by timezone, enqueue one job per user with aggregated content. Avoid sending 1M jobs at midnight UTC; spread over the hour.

Push channel in depth

Mobile push requires device tokens per app install. Store user_devices: user_id, platform (iOS/Android), token, last_seen. On send, worker loads active tokens for user_id; calls FCM (Android) or APNs (iOS). Invalid token response → mark device dead. Users with three devices get three push attempts unless you collapse to one notification per logical event. Payload size limits (~4KB) — deep links only, not full email body.

Email and SMS specifics

ChannelGotchaMitigation
EmailBounces and spam complaintsWebhook from SendGrid; suppress bad addresses
EmailHTML rendering across clientsTest templates; inline CSS for v1
SMSCost per segmentReserve for OTP and critical alerts only
SMSRegulatory opt-in (TCPA, etc.)Double opt-in stored in preferences

How this differs from chat

Chat is bidirectional real-time with read receipts. Notifications are mostly one-way fire-and-forget (plus optional in-app inbox). Chat needs WebSocket; notifications need durable queues and provider adapters. You can mention both use message queues but chat optimizes latency to milliseconds; notifications optimize reliable delivery over seconds.

Transactional outbox (advanced)

If order DB commit and notification enqueue must be atomic: write order row + outbox row in same DB transaction. Separate relay process reads outbox, publishes to Kafka, marks row sent. Prevents "order saved but notification never queued" without distributed transactions. Mention if interviewer pushes on consistency between DB and queue.

Sample opening (first three minutes)

Interviewer: "Design a notification system." You: "Before I draw boxes — which channels matter for v1: push, email, SMS, in-app? Is this transactional only or marketing too? For scale, should I assume millions per day? I will assume at-least-once delivery with idempotency keys, async workers per channel, and preference checks before enqueue." That opening shows product sense and sets scope.

What to say in the last five minutes

Close with: "Async queue per channel, template rendering before enqueue, preference checks at API, idempotency keys for producers, retries with dead-letter queue, separate workers for push/email/SMS." Mention message queues if you discussed Kafka already.

Mock interview checklist

  1. Listed channels and asked about priority / volume.
  2. Drew API → queue → workers → providers.
  3. Explained idempotency and at-least-once semantics.
  4. Mentioned user preferences and template rendering.
  5. Discussed retries, DLQ, and provider rate limits.

Closing summary

Notifications are a queue-and-adapter problem: accept fast, deliver async, scale per channel, and never duplicate transactional messages. Tie back to caching for idempotency keys and inbox unread counts.

More in this series