Are events the same as messages?

Events are a kind of message. Specifically, they describe a fact that already happened ('OrderPlaced'). Commands are messages that ask someone to do something ('PlaceOrder'). The distinction matters — it changes who owns the contract.

Should every microservice be event-driven?

No. Use events for fan-out and decoupling; use direct request/response (REST/gRPC) for synchronous user-facing flows. Most production systems mix both.

Is Kafka the only option?

No. RabbitMQ, NATS, AWS SNS+SQS, Google Pub/Sub, Redis Streams all work. Kafka shines for high-throughput, replay-able event streams; the others are often simpler for ordinary fan-out workloads.

Event-Driven Architecture: When Events Pay Off (and When They Don't)

Event-driven architecture is one of the most powerful — and most misused — patterns in modern backend design. Done right, it gives you decoupled services, easy fan-out, and the ability to add consumers without touching producers. Done wrong, it gives you untraceable bugs, "wait, why isn't this happening?" mysteries, and a deployment graph nobody can hold in their head.

This post walks through what events actually are, the patterns that work, and the failure modes nobody warns you about.

Events vs commands

The first distinction that clarifies a lot of confusion:

Command: "Please do X." Sender expects something to happen. Receiver owns the success or failure. Names are imperative: PlaceOrder, ChargeCard, SendEmail.
Event: "X happened." Sender is reporting a fact. Many or zero receivers may care. Names are past tense: OrderPlaced, CardCharged, EmailSent.

Why this matters: commands have one obvious owner, events don't. Events make it natural to add a new consumer (analytics, notifications, audit log) without coordinating with the producer. Commands force tight coupling between sender and receiver.

A good rule: services emit events about themselves; they receive commands about themselves. "OrderService emits OrderPlaced; OrderService receives PlaceOrder." Other services subscribe to OrderPlaced however they want.

When events earn their keep

Events shine in three scenarios:

Fan-out. One thing happened; many parties need to react. OrderPlaced → email service sends confirmation, inventory service decrements stock, analytics service records the sale, recommendation service updates the model. Without events, the order service has to know about all of them.
Asynchronous workflows. "When a user signs up: send email, create stripe customer, provision workspace, post to slack." Doing all of these inline makes signup slow and brittle. Emit an event; let each handler do its piece independently.
Cross-team integration. Team A's domain shouldn't know about Team B's. An event published by A is a public contract; B subscribes without A even knowing.

Where events are not the right tool:

Synchronous user flows that need an immediate answer. "Did the payment go through?" — that's a request/response, not an event.
Internal coordination within a single bounded context. A function call is fine; you don't need an event bus to talk to your own module.

Choreography vs orchestration

Two ways to coordinate work across multiple services.

Choreography

Each service emits events. Other services react. There's no central conductor.

OrderService → OrderPlaced
  → InventoryService reserves stock → InventoryReserved
  → PaymentService charges card → PaymentCharged
  → ShippingService schedules shipment → ShipmentScheduled

Pros. Loose coupling. Adding a new step is as simple as adding a subscriber. Resilient — failures in one service don't block others.

Cons. No single place to see "what is this workflow doing right now". Debugging requires tracing across services. Logic spread across many services can become hard to reason about.

Orchestration

A central workflow coordinator (an "orchestrator" or "saga manager") tells each service what to do next.

OrderOrchestrator:
  → call InventoryService.reserve()
  → call PaymentService.charge()
  → call ShippingService.schedule()
  → on any failure, run compensation

Pros. Workflow is visible in one place. Easier to debug, easier to add retries and compensation. Tools like Temporal, AWS Step Functions, Cadence make this very productive.

Cons. The orchestrator becomes a coordination point — and a potential bottleneck or single source of bugs.

In practice: use choreography for loosely-coupled fan-out (notifications, analytics, side effects), and orchestration for business-critical multi-step workflows (order placement, money movement) where you need visibility.

Sagas: distributed transactions without 2PC

When a workflow spans multiple services and you can't (or shouldn't) use distributed transactions, the saga pattern is the standard answer:

Each step has a corresponding compensating action that undoes it.
If step N fails, run the compensations for steps 1..N-1 in reverse order.
Each step and each compensation is idempotent — they may be retried.

1. ReserveInventory     compensate: ReleaseInventory
2. ChargePayment        compensate: RefundPayment
3. ScheduleShipment     compensate: CancelShipment

If ChargePayment fails, run ReleaseInventory. If everything succeeds, no compensations needed.

Sagas don't give you ACID — they give you "eventually consistent and reasonable". Customers might briefly see "Order pending, awaiting payment" before either becoming "Order confirmed" or "Order cancelled". That's usually fine; bank transfers work the same way.

The transactional outbox pattern

The most important pattern for reliable event publishing.

The naïve approach has a fatal flaw:

def place_order(order):
    save_order_to_db(order)        # writes to DB
    publish_event(OrderPlaced(...)) # writes to Kafka

What if the DB write succeeds but the publish fails? Or vice versa? You now have a state where the order exists but no one downstream knows, or "OrderPlaced" was published for an order that doesn't exist.

The outbox pattern fixes this:

def place_order(order):
    with transaction:
        save_order_to_db(order)
        save_event_to_outbox(OrderPlaced(...))  # same transaction

A separate worker reads the outbox table and publishes to Kafka. If publishing fails, it retries. The DB write and the "intent to publish" are atomic; actual publishing is best-effort with retries.

Every serious event-driven system needs something like this. Without it, you'll have inconsistencies you can't explain.

Event schemas and evolution

Events are a public contract. Treat them like API responses.

Use a schema registry (Avro, Protobuf, JSON Schema). Publishers register; consumers validate.
Backwards-compatible changes only. Add optional fields; don't remove or rename. Breaking changes mean a new event type, not a new version of the old one.
Include a schema version in every event. event_type: "OrderPlaced", version: 2.
Plan for replay. Old consumers will see new events; new consumers will see old events. Handle both.

A drifted event schema is one of the hardest classes of bug to debug, because it can affect every downstream system silently.

Failure modes nobody warns you about

The pain you'll learn the hard way:

Out-of-order events. Two events arrive in the wrong order, your handler does the wrong thing. Either guarantee ordering (Kafka per partition) or write idempotent handlers that work in any order.
Duplicate events. At-least-once delivery is the default. Your handlers must be idempotent or you'll process the same event twice.
Slow consumers. A consumer that can't keep up means lag grows, eventually disk fills, eventually back-pressure breaks producers. Monitor consumer lag obsessively.
Schema drift. Producer adds a field, consumer crashes on parse. Or consumer expects a field that producer stopped sending. Schema registry + automated checks prevent most of this.
Event amplification. Service A's event triggers B, which emits an event triggering C, which emits one triggering A. Loops. Storms. Sleep deprivation.

Three rules

Make handlers idempotent. Always. Without exception. Events will be retried; you must process the same event twice safely.
Use the outbox pattern for any event tied to a database write. Atomicity between business state and event publication is non-negotiable.
Design for replay. Sooner or later you'll need to replay a topic from the start (new consumer, lost data, bug in old handler). If your handlers can't safely process the entire history, you've painted yourself into a corner.

What to read next

Event-driven systems usually live on top of Kafka or a similar broker. Microservices observability covers how you actually debug them in production. The chat system HLD writeup shows event-driven design under real-time fanout requirements, and eventual consistency is the consistency model you live in once the bus is the spine of your system.