Tech · 6 min read
Circuit Breaker Pattern: How to Prevent Cascading Failures
When downstream services fail, naive retries amplify the problem. Circuit breakers detect failures and protect upstream systems. Implementation details and gotchas.
By Jarviix Engineering · Apr 19, 2026
When a downstream service starts to fail, your first instinct as an engineer is "let's retry." This is exactly the wrong response. Naive retries amplify load on a struggling service, often pushing transient failures into systemic outages. Circuit breakers solve this by detecting failure patterns and stopping requests to a failing dependency until it recovers.
This post covers what circuit breakers do, the standard state machine, configuration parameters, and the implementation gotchas that determine whether your circuit breaker helps or hurts.
The cascading failure problem
In microservice architectures, services depend on services depend on services. A single failure deep in the dependency tree can cascade upward catastrophically:
- Service C starts failing or slowing down (DB issues, memory pressure, cold start)
- Service B continues calling C; B's threads get stuck waiting for slow C responses
- B's thread pool fills up; B can't serve other requests
- Service A calls B; A's threads get stuck waiting for B
- A's thread pool fills up; users see timeouts everywhere
The original problem (small issue in C) becomes a system-wide outage in minutes. Worse, every service in the chain consumes resources continuing to call broken services, accelerating the collapse.
The retry pattern makes this worse: when C fails, B retries — sending more load to already-struggling C. This is precisely how brief incidents become major outages.
Circuit breakers break this cycle.
How circuit breakers work
A circuit breaker tracks the success/failure rate of calls to a downstream service. When failures exceed a threshold, the breaker "opens" and immediately rejects subsequent calls without attempting them. After a cooldown period, the breaker tentatively allows requests through to test if the service has recovered.
Three states:
Closed (normal)
- All requests pass through to downstream
- Track success/failure rate
- If failure rate exceeds threshold, transition to Open
Open (failing)
- All requests immediately rejected without calling downstream
- Saves downstream from further load
- Saves upstream from blocking on slow calls
- After cooldown period, transition to Half-Open
Half-Open (testing recovery)
- Allow a limited number of trial requests through
- If they succeed, transition to Closed (recovery)
- If they fail, transition back to Open (still broken)
This state machine is the core of every circuit breaker implementation.
Configuration parameters
Failure threshold
The percentage of failures that triggers the breaker. Common: 50% over a window of 20 requests minimum.
Too sensitive: opens on transient noise. Too lax: cascading failure progresses before opening.
Time window
How long to measure failures. Common: 10-60 seconds.
Sliding window vs. tumbling window: sliding is more responsive but more complex.
Minimum throughput
Don't open the breaker until at least N requests have been observed. Prevents opening on a single failure when traffic is low.
Cooldown duration
How long to stay open before testing recovery. Common: 30-60 seconds.
Too short: keeps hammering broken service. Too long: slow recovery after issues resolve.
Half-open trial requests
How many test requests to allow in half-open state. Common: 1-5.
Failure definition
What counts as a failure?
- HTTP errors (4xx? 5xx? both?)
- Timeouts (essential)
- Connection errors
- Specific exception types
Most systems count timeouts and 5xx errors; treat 4xx as application-level errors that don't indicate downstream health issues.
Implementation patterns
Per-dependency breakers
Each downstream service has its own breaker. If service C is failing, breaker for C opens, but breakers for D, E remain closed.
This is the standard pattern. Don't share breakers across dependencies.
Per-method breakers
Sometimes useful: one breaker per (service, method) pair. The /health endpoint may be working while /process-payment is failing.
Adds complexity; use only when behavioral differences justify it.
Bulkheads
Related pattern: limit concurrent calls to each downstream. Even if breaker is closed, you don't let one slow dependency consume your entire thread pool.
Common implementation: dedicated thread pool per dependency.
Service mesh integration
Modern service meshes (Istio, Linkerd) implement circuit breakers as sidecar proxies — no application code needed. Configuration via mesh policies.
Common implementations
Java
- Resilience4j (modern, lightweight)
- Hystrix (Netflix, deprecated but widely used in legacy systems)
- Spring Cloud Circuit Breaker (abstraction over implementations)
Python
- PyBreaker
- Tenacity (with custom logic)
Go
- Gobreaker (Sony)
- Hystrix-go (port of Netflix Hystrix)
.NET
- Polly
Service Mesh
- Istio (envoy-based)
- Linkerd
- Consul Connect
For most teams: pick a library that fits your stack and use defaults to start. Tune later based on observed behavior.
Fallback strategies
When a breaker is open, what should you return?
Cached values
Return last-known-good value. Useful for read-heavy workloads.
Default values
Return a sensible default (e.g., empty list, default config). Caller continues with degraded but functional behavior.
Static fallback
Return a hardcoded "service unavailable" message.
Graceful degradation
Disable the feature that needs the failing dependency; rest of the application works.
Failure (fast)
Return 503 immediately. Better than slow timeout, gives caller control.
The right fallback depends on the use case. Read-only display data: cache. Critical writes: fail fast. Optional features: graceful degradation.
Monitoring circuit breakers
Critical metrics:
- Breaker state changes (closed → open transitions)
- Open duration per breaker
- Rejection count (calls blocked while open)
- Half-open trial outcomes
- Failure rate on each protected dependency
Alert on:
- Any breaker open for >5 minutes (sustained downstream issue)
- High rate of breaker state oscillation (configuration may be wrong)
- Multiple breakers open simultaneously (potential cascading issue)
When circuit breakers don't help
- In-process calls: no cascading failure risk; circuit breakers add overhead
- Asynchronous fire-and-forget: no caller blocking, less benefit
- Operations that must succeed: critical writes can't gracefully fall back
- Idempotent retry-safe operations on transient failures: retries with backoff may suffice
Circuit breakers solve a specific problem (cascading failure from sustained downstream issues), not all reliability problems.
Common mistakes
- No circuit breaker at all: cascading failure on first major incident
- Single breaker for all dependencies: one bad dep takes down all calls
- Threshold too sensitive: breaker oscillates open/closed under normal noise
- Cooldown too short: hammers broken downstream during recovery
- No fallback strategy: open breaker just returns errors with no degradation
- No monitoring: breakers open without anyone noticing
- Counting 4xx as failures: opens breaker on legitimate client errors
- Not testing breaker behavior: never verifying it actually opens during incidents
Practical example
Resilience4j Java configuration:
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.slidingWindowSize(20)
.failureRateThreshold(50)
.minimumNumberOfCalls(10)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker paymentBreaker = CircuitBreaker.of("payment-service", config);
Supplier<PaymentResponse> guarded = CircuitBreaker
.decorateSupplier(paymentBreaker, () -> paymentClient.charge(request));
PaymentResponse response = Try.ofSupplier(guarded)
.recover(throwable -> getFallbackResponse())
.get();
This wraps payment calls; if 50% of last 20 calls fail, breaker opens for 30 seconds, then tries 3 trial calls before resuming.
What to read next
- Microservices observability — instrument everything.
- Idempotency in APIs — safe retries.
- Distributed locks — coordination patterns.
- Microservices vs monolith — when this pattern matters.
Circuit breakers are one of the highest-leverage resilience patterns in distributed systems. The cost of implementation is low — usually just a library and configuration. The cost of NOT having them is occasionally catastrophic — full system outages from issues that should have been contained. If you operate a non-trivial microservice architecture, every external call should go through a circuit breaker.
Frequently asked questions
When should I use a circuit breaker?
Any time you call a remote service that could fail or slow down. The textbook case is microservice-to-microservice calls, but it applies equally to database connections, third-party APIs (payment gateways, SMS providers), cache servers, and external HTTP services. The general rule: if a slow or failing dependency could cascade through your system, wrap calls to it in a circuit breaker. Skip it only for in-process calls or operations that can't fail meaningfully.
What's the difference between a circuit breaker and a retry?
Different problems. Retries help with transient failures (network blips, brief overloads). Circuit breakers help with systemic failures (downstream is fundamentally broken). They work together: retry transient errors, but stop retrying when the breaker opens. Without a breaker, retries amplify the load on a struggling service, often making things worse. Best practice: retry with exponential backoff, wrapped inside a circuit breaker.
Should every microservice call have a circuit breaker?
Yes, in production microservice architectures. Service meshes like Istio implement circuit breakers automatically per service, requiring no application code changes. If you don't have a service mesh, libraries like Resilience4j (Java), Polly (.NET), or PyBreaker (Python) provide the pattern. The marginal cost of adding one is low; the cost of cascading failure during production incidents is enormous.
Read next
Apr 19, 2026 · 6 min read
Microservices vs Monolith: How to Actually Decide
When a monolith is the right answer, when microservices earn their keep, and why most teams pick wrong because they're optimizing for the wrong constraint.
Apr 19, 2026 · 6 min read
API Rate Limiting Strategies: Token Bucket, Leaky Bucket, and Sliding Window
Rate limiting protects APIs from abuse and overload. The major algorithms, when each is appropriate, and how to implement them in distributed systems.
Apr 19, 2026 · 6 min read
Caching Strategies: Cache-Aside, Write-Through, Write-Back, and When to Use Each
Caching is the highest-leverage performance optimization. The patterns, the consistency trade-offs, and the invalidation strategies that actually work in production.