Design a Distributed Message Queue (Kafka-class)
Partitioned, replicated, append-only log with at-least-once delivery, ordered partitions, and consumer groups at 1M+ msgs/sec.
Intro
A Kafka-class message queue is the backbone of every event-driven architecture. The design is dominated by partitioning (for scale), replication (for durability), and consumer groups (for ordered, sharded consumption). Most candidates over-think delivery semantics — the right answer is at-least-once with idempotent consumers.
Functional
- Producer publishes messages to a topic.
- Consumer reads messages with replay support.
- Consumer groups for parallel processing with partition rebalancing.
- Topic creation + partition / retention configuration.
Non-functional
- Throughput ≥ 1 M msgs/sec per cluster.
- Producer p99 publish < 10 ms.
- Durability: replication factor 3, sync to majority.
- Ordered within partition; no global order.
Components
Brokers
Hold partitioned logs + replicate.
Topic metadata
Partitions, leaders, ISR — coordinated via ZooKeeper / KRaft.
Producer client
Batches + sends to leader of each partition.
Consumer client
Reads with offsets per partition.
Consumer group coordinator
Manages partition assignment + rebalance.
Schema registry
Avro / Protobuf schema versioning.
Trade-offs
At-least-once vs exactly-once
Pros
- At-least-once: simple, scales.
- Exactly-once: requires producer idempotency + transactions.
Cons
- At-least-once: consumers must dedupe.
- Exactly-once: throughput penalty.
Push vs pull consumers
Pros
- Pull: consumer self-throttles, simpler.
Cons
- Pull: poll overhead.
Scale concerns
- Hot partition (one key dominating).
- Consumer rebalance storms.
- Disk I/O bottleneck — sequential writes critical.
- Cross-region replication.