Is Kafka the right choice for simple background jobs?

Usually no. Kafka shines for high-throughput streams, event sourcing, and replayable logs. For ordinary background work queues, RabbitMQ, SQS, or Redis Streams are simpler.

How many partitions do I need?

At least as many as your maximum desired consumer parallelism. More partitions = more parallelism but also more rebalance overhead and more files on disk. Start with 12-24 for most workloads and adjust based on observed lag.

Can I delete data from Kafka?

Yes — by retention policy (time or size based) or via compaction (keep only the latest value per key). You don't typically delete individual messages. Retention is the design choice you make per topic.

Kafka Explained Simply: Topics, Partitions, Consumers, and the Mental Model That Makes It Click

Kafka is one of those technologies that confuses developers because it looks like a message queue but behaves like a database log. Once you internalize that one mental-model shift, every other piece — partitions, consumer groups, offsets, replay — falls into place.

This post is a calm walk through Kafka's core concepts and the patterns where it earns its place in production stacks.

The mental model: Kafka is a distributed log

A Kafka topic is, fundamentally, an append-only log file (or many of them — see partitions). Producers append messages to the end. Consumers read from positions in the log called offsets. The log is durable — messages stay around even after they've been read, until retention policy says otherwise.

This is a different shape from a traditional queue:

	Traditional queue (SQS, RabbitMQ)	Kafka
Lifetime	Message gone after consumed	Message stays until retention expires
Replay	Hard or impossible	Trivial — rewind offset
Multiple consumers	Each gets a copy or splits the work	Each consumer group reads independently
Ordering	Per-queue, often weak	Strict per-partition

The "log, not queue" perspective explains almost every Kafka design choice.

Topics and partitions

A topic is a logical stream — orders, user-events, clicks.

A topic is split into partitions for parallelism. Each partition is its own ordered log. Messages within a partition are strictly ordered; messages across partitions are not.

topic: orders
├─ partition 0: [msg, msg, msg, ...]
├─ partition 1: [msg, msg, msg, ...]
└─ partition 2: [msg, msg, msg, ...]

When you produce a message, you (or Kafka) pick which partition it goes to. The standard rule:

No key: round-robin across partitions.
With key: hash(key) % partitions. Same key always goes to the same partition.

Consequence: if you key by user_id, all events for user 42 land in the same partition and are processed in order. This is the foundation of "ordering where it matters" in Kafka.

Consumer groups

A consumer group is one or more consumers cooperating to process a topic.

Each partition is owned by exactly one consumer in a group.
Each consumer can own multiple partitions.
Consumers in different groups read independently — each group has its own offsets.

topic: orders (3 partitions)

group "checkout-processor" (2 consumers)
  consumer A → owns partitions 0, 1
  consumer B → owns partition 2

group "analytics" (1 consumer)
  consumer C → owns partitions 0, 1, 2

This gives you two superpowers:

Horizontal scaling within a group. Add a consumer; partitions get redistributed; throughput goes up.
Multiple independent consumers of the same topic. Analytics, billing, audit logs all read the same events without affecting each other.

The cap on parallelism within a group is the partition count. 12 partitions = up to 12 consumers in a group; the 13th sits idle.

Offsets

Each consumer group tracks where it is in each partition with an offset — the index of the next message to read.

partition 0: [m0, m1, m2, m3, m4, m5, ...]
                                  ^
                                  group "checkout" offset = 4
                              ^
                              group "analytics" offset = 3

Offsets are stored in Kafka itself (in a special internal topic __consumer_offsets). When a consumer commits an offset, it's saying "I've durably processed up to here; if I crash, restart me from this point."

Two flavors:

Auto-commit (default in many clients). Offsets are committed every few seconds. Easy, but you can replay or skip messages on crash.
Manual commit. Explicitly commit after processing. More control, more code.

For most production work, manual commit after successful processing is the right pattern.

Replication and durability

Each partition has multiple replicas — one leader and several followers. Producers write to the leader; followers replicate.

The producer can ask for various ack semantics:

acks=0: fire and forget. Fast, lossy.
acks=1: wait for the leader. OK, but data lost if leader fails before replication.
acks=all: wait for all in-sync replicas. Durable; pay latency.

For anything you actually care about, acks=all plus min.insync.replicas=2 is the safe default.

Retention

Messages don't disappear after being read. They live until retention says they should go away. Two policies:

Time/size based

log.retention.hours=168     # 7 days
log.retention.bytes=1073741824  # 1 GiB per partition

After the limit, old segments are deleted. Pure event streams (clicks, telemetry) usually use this.

Compacted

For each key, keep only the latest value. The topic becomes a "current state" snapshot — perfect for caching last-known values, replicating reference data, building changelog-driven materialized views.

key=user-42, value=v1   ← will be removed
key=user-43, value=v1
key=user-42, value=v2   ← latest, kept

You can replay a compacted topic from the beginning and get the latest value of every key — the foundation of Kafka Streams, Debezium CDC, and many event-sourcing patterns.

When Kafka is the right tool

Kafka shines in:

High-throughput event streams. Hundreds of thousands of messages per second sustainably, on commodity hardware.
Decoupling producers and consumers. Many producers, many consumers, varied speeds, none coordinating.
Replay. "Reprocess all of last week's data through the new model." Trivial in Kafka, hard in queues.
Event sourcing / CDC. The log is the source of truth; everything else is derived state.
Stream processing. Kafka Streams, Flink, Spark Streaming — all built on Kafka topics as the substrate.

When Kafka is overkill:

Simple background work. Use Redis Streams, RabbitMQ, or SQS. Kafka has operational weight that small workloads don't need.
Request/response. Kafka is one-way. Don't try to build RPC on top of it.
Strict global ordering. Per-partition ordering is easy; total ordering across a topic costs you all parallelism.

The patterns that matter

A few recurring patterns:

Idempotent consumers

At-least-once delivery is the default. Same message can be delivered twice (after a consumer crash, after a rebalance). Your consumer must process duplicates safely. (See idempotency in APIs.)

Consumer lag as the SLO

The single most important Kafka metric: consumer lag — the offset gap between latest produced message and last consumed offset, per partition. Growing lag = your consumers can't keep up. Set alerts. Plan capacity around it.

The transactional outbox

Don't do db.write() then kafka.produce() separately — they're not atomic. Use the outbox pattern: write the event to a DB table in the same transaction as the business state, then publish from the outbox. (Covered in detail in event-driven architecture.)

Schema registry

Treat events as a public contract. Use Avro, Protobuf, or JSON Schema with a registry. Validate on produce; consumers reject invalid messages. Stops "we changed the event shape and broke five downstream services" incidents.

Three rules

Pick partition count carefully. Repartitioning a live topic is operationally painful. Over-provision modestly (room to add consumers), don't go crazy.
Manual commit, after successful processing. Auto-commit is "fire and pray". Manual gives you the right semantics with a few extra lines.
Monitor consumer lag like uptime. It's the single most actionable Kafka health metric. Page on sustained lag growth.

What to read next

Kafka is the substrate for most modern event-driven systems. Event-driven architecture covers the patterns built on top; microservices observability covers how to debug them in production. The Twitter timeline HLD writeup is the canonical example of using Kafka-style partitioned logs for fanout at scale — a great applied read once the concepts here have clicked.