Design a Chat System (WhatsApp-class)
Persistent connections, fanout per device, store-and-forward for offline users.
Intro
A chat system is event-driven: a message moves from the sender's device through one or more brokers to every recipient device, with reliable delivery and read receipts. The hard parts are connection management, ordering, and offline delivery.
Functional
- 1:1 and group messaging.
- Delivery receipts: sent, delivered, read.
- Online presence.
- Offline message storage with TTL.
Non-functional
- Message delivery p95 < 500 ms.
- 1 B users, 200 M concurrent connections.
- At-least-once delivery; client must dedupe.
Components
Edge gateway
Holds the persistent WebSocket / TLS connection per device.
Session registry
Maps user_id → connected gateway node (Redis / memdb).
Message bus
Kafka / pub-sub. Topic per recipient or shard.
Message store
Cassandra: per-user inbox with timestamp + message id.
Push notification service
FCM / APNS for offline recipients.
Trade-offs
WebSocket vs. long-poll
Pros
- WebSocket is bidirectional, low overhead at scale.
Cons
- WebSocket needs sticky load balancing and connection state on edges.
Per-recipient topic vs. per-conversation topic
Pros
- Per-recipient → fan-out is one write.
- Per-conversation → small consumer count.
Cons
- Per-recipient explodes for huge groups.
Scale concerns
- Connection density per gateway (~250k–1M with tuned epoll).
- Group messages: fan out at gateway, not at sender.
- Backfill on reconnect — stream from message store, not from broker.