Tech · 6 min read
Message Queues Compared: Kafka, RabbitMQ, SQS, and When to Use Each
Choosing a message queue locks in architectural decisions for years. The major options compared by throughput, ordering, durability, and operational complexity.
By Jarviix Engineering · Apr 19, 2026
Message queues are foundational infrastructure for most modern backend systems. They decouple producers from consumers, smooth traffic spikes, enable async processing, and form the backbone of event-driven architectures. But the choice of queue technology has long-lasting implications — it determines throughput limits, operational complexity, ordering guarantees, and the messaging patterns your team can use.
This post compares the major message queue technologies and provides a decision framework for choosing among them.
Why use a message queue
Decoupling
Producers don't need to know about consumers. Add new consumers without changing producer code. Replace consumers without producer downtime.
Async processing
Long-running work (image processing, email sending, report generation) doesn't block user-facing requests. Producer enqueues; consumer processes asynchronously.
Spike absorption
Traffic spikes that would overwhelm synchronous processing get queued and processed at sustainable rate.
Reliability
If a consumer crashes mid-processing, message remains in queue and gets retried. Compare to direct HTTP calls where transient failures lose data.
Multi-consumer fan-out
One message to many consumers. Useful for analytics, audit logging, multi-purpose event handling.
Buffering
Queue acts as buffer between mismatched producer and consumer rates.
The major contenders
Apache Kafka
Distributed log-based messaging. Originally LinkedIn; now Apache project; Confluent is the major commercial support.
Architecture: Topics partitioned across brokers; consumers read partitions independently; messages persist on disk for configurable retention (days to years).
Throughput: Millions of messages/second per cluster. Designed for high-throughput.
Ordering: Within a partition (not across partitions).
Durability: Excellent — replicated to multiple brokers; survives node failures.
Replay: Built-in — consumers can re-read from any offset.
Operational complexity: High. ZooKeeper (or KRaft now), partition rebalancing, broker tuning, schema registry, consumer group coordination.
Use cases:
- Event sourcing
- Log aggregation
- Real-time analytics pipelines
- High-throughput message bus
- Event-driven microservice architectures
Don't use for:
- Low-volume task queues (overkill)
- Simple async work distribution
- Teams without Kafka operational experience
RabbitMQ
Traditional message broker. Implements AMQP protocol; supports MQTT, STOMP. Mature, battle-tested.
Architecture: Queues, exchanges, bindings. Sophisticated routing (topic-based, header-based, fanout).
Throughput: 20-50K msgs/sec per node (much higher with optimizations and lazy queues).
Ordering: FIFO per queue.
Durability: Good — persistent queues, message acknowledgments, mirroring across nodes.
Replay: Limited — once consumed, message is gone (unless you build replay layer).
Operational complexity: Moderate. Single-node easy; clustering more complex.
Use cases:
- Traditional task queues (background jobs)
- Complex routing (different consumers for different message types)
- RPC-style request/response over messaging
- Workflow orchestration
- Lower-throughput, latency-sensitive workloads
Don't use for:
- Extremely high throughput (>100K/sec sustained)
- Long-term message retention
- Event sourcing patterns
AWS SQS
Fully managed queue from AWS. Two flavors: Standard (at-least-once, no ordering) and FIFO (exactly-once, strict ordering).
Architecture: AWS-managed; no servers to operate.
Throughput: Standard SQS unlimited; FIFO 3,000 msgs/sec per queue (higher with batching).
Ordering: Standard — best effort. FIFO — strict.
Durability: Excellent — replicated across AZs.
Replay: None — once acknowledged, gone.
Operational complexity: Minimal — AWS manages everything.
Cost: Per-message pricing. Cheap for low-medium volume; can become expensive at very high volumes.
Use cases:
- Simple async task queues
- Decoupling AWS Lambda functions
- Microservice async communication on AWS
- Anywhere "managed simplicity" beats "operational sophistication"
Don't use for:
- Multi-cloud architectures (vendor lock-in)
- Very high throughput where per-message cost adds up
- Replay or event sourcing patterns
- Complex routing requirements
Apache Pulsar
Newer than Kafka; designed to address Kafka pain points. Two-tier architecture (compute + storage separation).
Architecture: Brokers + BookKeeper for persistence. Topics, partitions, subscriptions.
Throughput: Comparable to Kafka.
Ordering: Within partition.
Durability: Excellent.
Operational complexity: High — more components than Kafka.
Use cases:
- Multi-tenant scenarios
- Geo-replication requirements
- When Kafka's specific limitations bite
Adoption: Growing but still much smaller than Kafka. Use Kafka unless you have specific Pulsar requirements.
Redis Streams
Built into Redis; lightweight streaming.
Throughput: Very high (Redis is fast).
Durability: Configurable (depends on Redis persistence settings).
Replay: Yes, within retention.
Operational complexity: Low if already running Redis.
Use cases:
- Lightweight event streaming
- When Redis is already in stack
- Real-time analytics with limited retention
Don't use for:
- Production-critical event streams (Kafka is more battle-tested)
- Very large message volumes (Redis is RAM-bound)
Google Pub/Sub
Fully managed pub/sub from GCP. Comparable to SQS in spirit but with stronger fan-out semantics.
Use cases: GCP-native event-driven architectures.
Decision framework
| Requirement | Recommended |
|---|---|
| Simple async tasks, AWS-only | SQS |
| Simple async tasks, multi-cloud | RabbitMQ |
| High-throughput event stream | Kafka |
| Event sourcing, replay needed | Kafka |
| Complex routing, RPC patterns | RabbitMQ |
| Already running Redis, lightweight needs | Redis Streams |
| GCP-native, simple | Pub/Sub |
| Multi-tenant, geo-distributed | Pulsar (or Kafka with care) |
Key concepts you must understand
At-least-once vs at-most-once vs exactly-once
- At-most-once: messages may be lost, never duplicated. Fastest, simplest. Use for non-critical events.
- At-least-once: messages never lost, may be duplicated. Common default. Requires consumer idempotency.
- Exactly-once: messages delivered exactly once. Hard to achieve; comes with throughput cost. Modern Kafka supports this with transactions.
Most production systems aim for at-least-once with idempotent consumers — practical exactly-once.
Consumer groups
Multiple consumers process from same topic; each message goes to one consumer in the group (load balancing). Different groups all get all messages (fan-out).
Partitioning
Topic split into partitions; each partition processed independently. Provides parallelism. Order is per-partition only.
Acknowledgments
Consumer must acknowledge messages; unacknowledged messages get redelivered. Critical for durability.
Dead-letter queues (DLQ)
Messages that fail processing repeatedly go to a DLQ for manual investigation. Essential for production systems.
Operational considerations
Monitoring
- Queue depth
- Consumer lag (time between produce and consume)
- Failed message rate
- DLQ size
- Throughput (messages/sec)
Backpressure handling
- What happens when queues fill up?
- Producer-side throttling? Reject? Block?
- Define behavior before scale forces the question
Schema management
- How do you evolve message formats?
- Schema registry (Confluent, Apicurio) for Kafka
- Versioning conventions for everything else
Observability
- Distributed tracing across producer/queue/consumer
- Correlation IDs in messages
- Detailed logging on consumer failures
Common mistakes
- Choosing Kafka because it's "industry standard": operational complexity outweighs benefits for many use cases
- No idempotency in consumers: at-least-once delivery means duplicates; non-idempotent consumers create data corruption
- No DLQ: failed messages disappear or block the queue
- Ignoring consumer lag: queues quietly back up while everyone assumes things are working
- Hardcoded queue names without versioning: schema/structural changes become migration nightmares
- Using queues for synchronous workflows: introducing 500ms latency on a request that should be 50ms
- Massive messages: blobs in messages that should be in object storage with messages containing references
What to read next
- Kafka explained simply — Kafka deep-dive.
- Event-driven architecture — patterns built on queues.
- Eventual consistency — what async messaging forces.
- Distributed locks — coordination beyond queues.
Choosing a message queue is one of the more consequential architectural decisions a backend team makes. The right choice depends on throughput, durability, operational maturity, and team experience — not just feature lists. Start simple with the most boring tool that solves your problem; reach for sophistication only when concrete requirements demand it.
Frequently asked questions
When should I use Kafka vs RabbitMQ?
Kafka for high-throughput, append-only event streams where consumers may be slow or numerous (analytics, event sourcing, log aggregation). RabbitMQ for traditional task queues with complex routing, RPC patterns, or low-throughput message workflows. Throughput-wise: Kafka handles millions of msgs/sec; RabbitMQ tops out around 50K/sec per node. If you don't know, start with RabbitMQ — it's simpler and Kafka's operational complexity is significant.
Is SQS as capable as Kafka or RabbitMQ?
Different category. SQS is a fully managed standard queue (FIFO available) with simple semantics — you pay per message, no operational overhead, scales to massive throughput. It lacks Kafka's replay capability, complex routing, and partitioning model. For straightforward async work distribution, SQS is excellent and removes operational burden. For event sourcing, complex routing, or multi-consumer patterns, you need Kafka or RabbitMQ.
Can I run multiple message queues in one architecture?
Yes, and it's common. Many production systems use Kafka for event streaming + SQS for simple async tasks + RabbitMQ for legacy integrations. The cost: more operational surface area, more knowledge required across the team. The benefit: each tool used for its strengths. Start simple with one queue; add others only when specific limitations of the first justify it.
Read next
Apr 19, 2026 · 6 min read
API Rate Limiting Strategies: Token Bucket, Leaky Bucket, and Sliding Window
Rate limiting protects APIs from abuse and overload. The major algorithms, when each is appropriate, and how to implement them in distributed systems.
Apr 19, 2026 · 6 min read
Caching Strategies: Cache-Aside, Write-Through, Write-Back, and When to Use Each
Caching is the highest-leverage performance optimization. The patterns, the consistency trade-offs, and the invalidation strategies that actually work in production.
Apr 19, 2026 · 6 min read
Caching Strategies for Backend Engineers: Cache-Aside, Write-Through, and the Rest
How to actually use a cache — when to use cache-aside, write-through, write-behind, refresh-ahead — and the failure modes (thundering herd, stampede, drift) that bite in production.