When should I use a circuit breaker?

Any time you call a remote service that could fail or slow down. The textbook case is microservice-to-microservice calls, but it applies equally to database connections, third-party APIs (payment gateways, SMS providers), cache servers, and external HTTP services. The general rule: if a slow or failing dependency could cascade through your system, wrap calls to it in a circuit breaker. Skip it only for in-process calls or operations that can't fail meaningfully.

What's the difference between a circuit breaker and a retry?

Different problems. Retries help with transient failures (network blips, brief overloads). Circuit breakers help with systemic failures (downstream is fundamentally broken). They work together: retry transient errors, but stop retrying when the breaker opens. Without a breaker, retries amplify the load on a struggling service, often making things worse. Best practice: retry with exponential backoff, wrapped inside a circuit breaker.

Should every microservice call have a circuit breaker?

Yes, in production microservice architectures. Service meshes like Istio implement circuit breakers automatically per service, requiring no application code changes. If you don't have a service mesh, libraries like Resilience4j (Java), Polly (.NET), or PyBreaker (Python) provide the pattern. The marginal cost of adding one is low; the cost of cascading failure during production incidents is enormous.

Circuit Breaker Pattern: How to Prevent Cascading Failures

When a downstream service starts to fail, your first instinct as an engineer is "let's retry." This is exactly the wrong response. Naive retries amplify load on a struggling service, often pushing transient failures into systemic outages. Circuit breakers solve this by detecting failure patterns and stopping requests to a failing dependency until it recovers.

This post covers what circuit breakers do, the standard state machine, configuration parameters, and the implementation gotchas that determine whether your circuit breaker helps or hurts.

The cascading failure problem

In microservice architectures, services depend on services depend on services. A single failure deep in the dependency tree can cascade upward catastrophically:

Service C starts failing or slowing down (DB issues, memory pressure, cold start)
Service B continues calling C; B's threads get stuck waiting for slow C responses
B's thread pool fills up; B can't serve other requests
Service A calls B; A's threads get stuck waiting for B
A's thread pool fills up; users see timeouts everywhere

The original problem (small issue in C) becomes a system-wide outage in minutes. Worse, every service in the chain consumes resources continuing to call broken services, accelerating the collapse.

The retry pattern makes this worse: when C fails, B retries — sending more load to already-struggling C. This is precisely how brief incidents become major outages.

Circuit breakers break this cycle.

How circuit breakers work

A circuit breaker tracks the success/failure rate of calls to a downstream service. When failures exceed a threshold, the breaker "opens" and immediately rejects subsequent calls without attempting them. After a cooldown period, the breaker tentatively allows requests through to test if the service has recovered.

Three states:

Closed (normal)

All requests pass through to downstream
Track success/failure rate
If failure rate exceeds threshold, transition to Open

Open (failing)

All requests immediately rejected without calling downstream
Saves downstream from further load
Saves upstream from blocking on slow calls
After cooldown period, transition to Half-Open

Half-Open (testing recovery)

Allow a limited number of trial requests through
If they succeed, transition to Closed (recovery)
If they fail, transition back to Open (still broken)

This state machine is the core of every circuit breaker implementation.

Configuration parameters

Failure threshold

The percentage of failures that triggers the breaker. Common: 50% over a window of 20 requests minimum.

Too sensitive: opens on transient noise. Too lax: cascading failure progresses before opening.

Time window

How long to measure failures. Common: 10-60 seconds.

Sliding window vs. tumbling window: sliding is more responsive but more complex.

Minimum throughput

Don't open the breaker until at least N requests have been observed. Prevents opening on a single failure when traffic is low.

Cooldown duration

How long to stay open before testing recovery. Common: 30-60 seconds.

Too short: keeps hammering broken service. Too long: slow recovery after issues resolve.

Half-open trial requests

How many test requests to allow in half-open state. Common: 1-5.

Failure definition

What counts as a failure?

HTTP errors (4xx? 5xx? both?)
Timeouts (essential)
Connection errors
Specific exception types

Most systems count timeouts and 5xx errors; treat 4xx as application-level errors that don't indicate downstream health issues.

Implementation patterns

Per-dependency breakers

Each downstream service has its own breaker. If service C is failing, breaker for C opens, but breakers for D, E remain closed.

This is the standard pattern. Don't share breakers across dependencies.

Per-method breakers

Sometimes useful: one breaker per (service, method) pair. The /health endpoint may be working while /process-payment is failing.

Adds complexity; use only when behavioral differences justify it.

Bulkheads

Related pattern: limit concurrent calls to each downstream. Even if breaker is closed, you don't let one slow dependency consume your entire thread pool.

Common implementation: dedicated thread pool per dependency.

Service mesh integration

Modern service meshes (Istio, Linkerd) implement circuit breakers as sidecar proxies — no application code needed. Configuration via mesh policies.

Common implementations

Java

Resilience4j (modern, lightweight)
Hystrix (Netflix, deprecated but widely used in legacy systems)
Spring Cloud Circuit Breaker (abstraction over implementations)

Python

PyBreaker
Tenacity (with custom logic)

Go

Gobreaker (Sony)
Hystrix-go (port of Netflix Hystrix)

.NET

Polly

Service Mesh

Istio (envoy-based)
Linkerd
Consul Connect

For most teams: pick a library that fits your stack and use defaults to start. Tune later based on observed behavior.

Fallback strategies

When a breaker is open, what should you return?

Cached values

Return last-known-good value. Useful for read-heavy workloads.

Default values

Return a sensible default (e.g., empty list, default config). Caller continues with degraded but functional behavior.

Static fallback

Return a hardcoded "service unavailable" message.

Graceful degradation

Disable the feature that needs the failing dependency; rest of the application works.

Failure (fast)

Return 503 immediately. Better than slow timeout, gives caller control.

The right fallback depends on the use case. Read-only display data: cache. Critical writes: fail fast. Optional features: graceful degradation.

Monitoring circuit breakers

Critical metrics:

Breaker state changes (closed → open transitions)
Open duration per breaker
Rejection count (calls blocked while open)
Half-open trial outcomes
Failure rate on each protected dependency

Alert on:

Any breaker open for >5 minutes (sustained downstream issue)
High rate of breaker state oscillation (configuration may be wrong)
Multiple breakers open simultaneously (potential cascading issue)

When circuit breakers don't help

In-process calls: no cascading failure risk; circuit breakers add overhead
Asynchronous fire-and-forget: no caller blocking, less benefit
Operations that must succeed: critical writes can't gracefully fall back
Idempotent retry-safe operations on transient failures: retries with backoff may suffice

Circuit breakers solve a specific problem (cascading failure from sustained downstream issues), not all reliability problems.

Common mistakes

No circuit breaker at all: cascading failure on first major incident
Single breaker for all dependencies: one bad dep takes down all calls
Threshold too sensitive: breaker oscillates open/closed under normal noise
Cooldown too short: hammers broken downstream during recovery
No fallback strategy: open breaker just returns errors with no degradation
No monitoring: breakers open without anyone noticing
Counting 4xx as failures: opens breaker on legitimate client errors
Not testing breaker behavior: never verifying it actually opens during incidents

Practical example

Resilience4j Java configuration:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .slidingWindowSize(20)
    .failureRateThreshold(50)
    .minimumNumberOfCalls(10)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(3)
    .build();

CircuitBreaker paymentBreaker = CircuitBreaker.of("payment-service", config);

Supplier<PaymentResponse> guarded = CircuitBreaker
    .decorateSupplier(paymentBreaker, () -> paymentClient.charge(request));

PaymentResponse response = Try.ofSupplier(guarded)
    .recover(throwable -> getFallbackResponse())
    .get();

This wraps payment calls; if 50% of last 20 calls fail, breaker opens for 30 seconds, then tries 3 trial calls before resuming.

What to read next

Microservices observability — instrument everything.
Idempotency in APIs — safe retries.
Distributed locks — coordination patterns.
Microservices vs monolith — when this pattern matters.

Circuit breakers are one of the highest-leverage resilience patterns in distributed systems. The cost of implementation is low — usually just a library and configuration. The cost of NOT having them is occasionally catastrophic — full system outages from issues that should have been contained. If you operate a non-trivial microservice architecture, every external call should go through a circuit breaker.