Skip to content
Jarviix

Tech · 6 min read

Circuit Breaker Pattern: How to Prevent Cascading Failures

When downstream services fail, naive retries amplify the problem. Circuit breakers detect failures and protect upstream systems. Implementation details and gotchas.

By Jarviix Engineering · Apr 19, 2026

Circuit breaker electrical panel
Photo via Unsplash

When a downstream service starts to fail, your first instinct as an engineer is "let's retry." This is exactly the wrong response. Naive retries amplify load on a struggling service, often pushing transient failures into systemic outages. Circuit breakers solve this by detecting failure patterns and stopping requests to a failing dependency until it recovers.

This post covers what circuit breakers do, the standard state machine, configuration parameters, and the implementation gotchas that determine whether your circuit breaker helps or hurts.

The cascading failure problem

In microservice architectures, services depend on services depend on services. A single failure deep in the dependency tree can cascade upward catastrophically:

  1. Service C starts failing or slowing down (DB issues, memory pressure, cold start)
  2. Service B continues calling C; B's threads get stuck waiting for slow C responses
  3. B's thread pool fills up; B can't serve other requests
  4. Service A calls B; A's threads get stuck waiting for B
  5. A's thread pool fills up; users see timeouts everywhere

The original problem (small issue in C) becomes a system-wide outage in minutes. Worse, every service in the chain consumes resources continuing to call broken services, accelerating the collapse.

The retry pattern makes this worse: when C fails, B retries — sending more load to already-struggling C. This is precisely how brief incidents become major outages.

Circuit breakers break this cycle.

How circuit breakers work

A circuit breaker tracks the success/failure rate of calls to a downstream service. When failures exceed a threshold, the breaker "opens" and immediately rejects subsequent calls without attempting them. After a cooldown period, the breaker tentatively allows requests through to test if the service has recovered.

Three states:

Closed (normal)

  • All requests pass through to downstream
  • Track success/failure rate
  • If failure rate exceeds threshold, transition to Open

Open (failing)

  • All requests immediately rejected without calling downstream
  • Saves downstream from further load
  • Saves upstream from blocking on slow calls
  • After cooldown period, transition to Half-Open

Half-Open (testing recovery)

  • Allow a limited number of trial requests through
  • If they succeed, transition to Closed (recovery)
  • If they fail, transition back to Open (still broken)

This state machine is the core of every circuit breaker implementation.

Configuration parameters

Failure threshold

The percentage of failures that triggers the breaker. Common: 50% over a window of 20 requests minimum.

Too sensitive: opens on transient noise. Too lax: cascading failure progresses before opening.

Time window

How long to measure failures. Common: 10-60 seconds.

Sliding window vs. tumbling window: sliding is more responsive but more complex.

Minimum throughput

Don't open the breaker until at least N requests have been observed. Prevents opening on a single failure when traffic is low.

Cooldown duration

How long to stay open before testing recovery. Common: 30-60 seconds.

Too short: keeps hammering broken service. Too long: slow recovery after issues resolve.

Half-open trial requests

How many test requests to allow in half-open state. Common: 1-5.

Failure definition

What counts as a failure?

  • HTTP errors (4xx? 5xx? both?)
  • Timeouts (essential)
  • Connection errors
  • Specific exception types

Most systems count timeouts and 5xx errors; treat 4xx as application-level errors that don't indicate downstream health issues.

Implementation patterns

Per-dependency breakers

Each downstream service has its own breaker. If service C is failing, breaker for C opens, but breakers for D, E remain closed.

This is the standard pattern. Don't share breakers across dependencies.

Per-method breakers

Sometimes useful: one breaker per (service, method) pair. The /health endpoint may be working while /process-payment is failing.

Adds complexity; use only when behavioral differences justify it.

Bulkheads

Related pattern: limit concurrent calls to each downstream. Even if breaker is closed, you don't let one slow dependency consume your entire thread pool.

Common implementation: dedicated thread pool per dependency.

Service mesh integration

Modern service meshes (Istio, Linkerd) implement circuit breakers as sidecar proxies — no application code needed. Configuration via mesh policies.

Common implementations

Java

  • Resilience4j (modern, lightweight)
  • Hystrix (Netflix, deprecated but widely used in legacy systems)
  • Spring Cloud Circuit Breaker (abstraction over implementations)

Python

  • PyBreaker
  • Tenacity (with custom logic)

Go

  • Gobreaker (Sony)
  • Hystrix-go (port of Netflix Hystrix)

.NET

  • Polly

Service Mesh

  • Istio (envoy-based)
  • Linkerd
  • Consul Connect

For most teams: pick a library that fits your stack and use defaults to start. Tune later based on observed behavior.

Fallback strategies

When a breaker is open, what should you return?

Cached values

Return last-known-good value. Useful for read-heavy workloads.

Default values

Return a sensible default (e.g., empty list, default config). Caller continues with degraded but functional behavior.

Static fallback

Return a hardcoded "service unavailable" message.

Graceful degradation

Disable the feature that needs the failing dependency; rest of the application works.

Failure (fast)

Return 503 immediately. Better than slow timeout, gives caller control.

The right fallback depends on the use case. Read-only display data: cache. Critical writes: fail fast. Optional features: graceful degradation.

Monitoring circuit breakers

Critical metrics:

  • Breaker state changes (closed → open transitions)
  • Open duration per breaker
  • Rejection count (calls blocked while open)
  • Half-open trial outcomes
  • Failure rate on each protected dependency

Alert on:

  • Any breaker open for >5 minutes (sustained downstream issue)
  • High rate of breaker state oscillation (configuration may be wrong)
  • Multiple breakers open simultaneously (potential cascading issue)

When circuit breakers don't help

  • In-process calls: no cascading failure risk; circuit breakers add overhead
  • Asynchronous fire-and-forget: no caller blocking, less benefit
  • Operations that must succeed: critical writes can't gracefully fall back
  • Idempotent retry-safe operations on transient failures: retries with backoff may suffice

Circuit breakers solve a specific problem (cascading failure from sustained downstream issues), not all reliability problems.

Common mistakes

  • No circuit breaker at all: cascading failure on first major incident
  • Single breaker for all dependencies: one bad dep takes down all calls
  • Threshold too sensitive: breaker oscillates open/closed under normal noise
  • Cooldown too short: hammers broken downstream during recovery
  • No fallback strategy: open breaker just returns errors with no degradation
  • No monitoring: breakers open without anyone noticing
  • Counting 4xx as failures: opens breaker on legitimate client errors
  • Not testing breaker behavior: never verifying it actually opens during incidents

Practical example

Resilience4j Java configuration:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .slidingWindowSize(20)
    .failureRateThreshold(50)
    .minimumNumberOfCalls(10)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(3)
    .build();

CircuitBreaker paymentBreaker = CircuitBreaker.of("payment-service", config);

Supplier<PaymentResponse> guarded = CircuitBreaker
    .decorateSupplier(paymentBreaker, () -> paymentClient.charge(request));

PaymentResponse response = Try.ofSupplier(guarded)
    .recover(throwable -> getFallbackResponse())
    .get();

This wraps payment calls; if 50% of last 20 calls fail, breaker opens for 30 seconds, then tries 3 trial calls before resuming.

Circuit breakers are one of the highest-leverage resilience patterns in distributed systems. The cost of implementation is low — usually just a library and configuration. The cost of NOT having them is occasionally catastrophic — full system outages from issues that should have been contained. If you operate a non-trivial microservice architecture, every external call should go through a circuit breaker.

Frequently asked questions

When should I use a circuit breaker?

Any time you call a remote service that could fail or slow down. The textbook case is microservice-to-microservice calls, but it applies equally to database connections, third-party APIs (payment gateways, SMS providers), cache servers, and external HTTP services. The general rule: if a slow or failing dependency could cascade through your system, wrap calls to it in a circuit breaker. Skip it only for in-process calls or operations that can't fail meaningfully.

What's the difference between a circuit breaker and a retry?

Different problems. Retries help with transient failures (network blips, brief overloads). Circuit breakers help with systemic failures (downstream is fundamentally broken). They work together: retry transient errors, but stop retrying when the breaker opens. Without a breaker, retries amplify the load on a struggling service, often making things worse. Best practice: retry with exponential backoff, wrapped inside a circuit breaker.

Should every microservice call have a circuit breaker?

Yes, in production microservice architectures. Service meshes like Istio implement circuit breakers automatically per service, requiring no application code changes. If you don't have a service mesh, libraries like Resilience4j (Java), Polly (.NET), or PyBreaker (Python) provide the pattern. The marginal cost of adding one is low; the cost of cascading failure during production incidents is enormous.

Related Jarviix tools

Read paired with the calculator that does the math.

Read next