Skip to content
Jarviix
HLD10 min read

Design a Chat System (WhatsApp-class)

Persistent connections, fanout per device, store-and-forward for offline users.

hldsystem-design

Intro

A chat system is event-driven: a message moves from the sender's device through one or more brokers to every recipient device, with reliable delivery and read receipts. The hard parts are connection management, ordering, and offline delivery.

Functional

  • 1:1 and group messaging.
  • Delivery receipts: sent, delivered, read.
  • Online presence.
  • Offline message storage with TTL.

Non-functional

  • Message delivery p95 < 500 ms.
  • 1 B users, 200 M concurrent connections.
  • At-least-once delivery; client must dedupe.

Components

  • Edge gateway

    Holds the persistent WebSocket / TLS connection per device.

  • Session registry

    Maps user_id → connected gateway node (Redis / memdb).

  • Message bus

    Kafka / pub-sub. Topic per recipient or shard.

  • Message store

    Cassandra: per-user inbox with timestamp + message id.

  • Push notification service

    FCM / APNS for offline recipients.

Trade-offs

WebSocket vs. long-poll

Pros

  • WebSocket is bidirectional, low overhead at scale.

Cons

  • WebSocket needs sticky load balancing and connection state on edges.

Per-recipient topic vs. per-conversation topic

Pros

  • Per-recipient → fan-out is one write.
  • Per-conversation → small consumer count.

Cons

  • Per-recipient explodes for huge groups.

Scale concerns

  • Connection density per gateway (~250k–1M with tuned epoll).
  • Group messages: fan out at gateway, not at sender.
  • Backfill on reconnect — stream from message store, not from broker.

Related reads