Tech · 6 min read
Observability for Microservices: Logs, Metrics, and Traces That Actually Help
How to instrument a service so that when something breaks at 3am, you can find the cause in minutes — not hours. Logs, metrics, traces, and the patterns that tie them together.
By Jarviix Engineering · Apr 19, 2026
When a single user's checkout fails at 3am, observability is the difference between "fixed in 15 minutes" and "still investigating at noon". For monoliths it's straightforward — read the logs. For microservices, where one user request fans out across a dozen services, you need something better.
This post covers what to instrument, how the three pillars (logs, metrics, traces) fit together, and the patterns that actually help on-call engineers.
The three pillars, briefly
The standard framing:
- Metrics — numerical aggregates over time. "Requests per second", "p99 latency", "error rate". Cheap to store, fast to query, perfect for dashboards and alerts.
- Logs — discrete events with arbitrary detail. "User 42 hit the /checkout endpoint and we returned 500 because Stripe timed out." Expensive to store, rich in context.
- Traces — a representation of one request's journey across services. "User 42's request: API gateway → cart service → inventory service (took 2s) → payment service (timed out)."
You need all three. Each one answers a different question:
- Metrics: is something wrong? (alerts, dashboards)
- Logs: what exactly happened in this one case? (deep debugging)
- Traces: which service caused the problem? (distributed root cause)
Metrics: the foundation
Three metric types cover most needs:
Counters
Monotonically increasing values. "Total HTTP requests served". Differentiated to get rates: "requests per second".
http_requests_total.labels(method="GET", route="/users", status="200").inc()
Gauges
Point-in-time values. "Current open connections", "queue depth", "CPU temperature".
Histograms
Distributions of values. "Request duration in seconds". Lets you compute p50, p95, p99 percentiles.
request_duration_seconds.labels(route="/users").observe(0.234)
The four golden signals
The minimum every service should expose, popularized by Google's SRE book:
- Latency. How long requests take, broken down by success/failure.
- Traffic. Requests per second.
- Errors. Rate of failed requests.
- Saturation. How "full" the service is — CPU, memory, queue depth.
Alert on these. Build dashboards on these. If a service has nothing else, it should at least have these four.
RED and USE methods
Two complementary sets:
- RED (for request-driven services): Rate, Errors, Duration.
- USE (for resources): Utilization, Saturation, Errors.
Combine them — RED for endpoints, USE for the underlying CPU/disk/network/queue.
Logs: rich detail, used sparingly
Logs are where you record the things metrics can't capture: error stack traces, request bodies (sanitized), business-context information.
Key practices:
Structured logs
Plain text logs are useless at scale. Use JSON.
{
"timestamp": "2026-04-19T03:14:22Z",
"level": "error",
"service": "checkout",
"trace_id": "5b7c9...",
"span_id": "a3f1...",
"user_id": "42",
"order_id": "ord_abc",
"message": "stripe charge failed",
"error": "TimeoutError: connection reset"
}
Now you can query: "show me all errors for user 42", "show me everything in trace 5b7c9", "show me all stripe timeouts in the last hour".
Log levels and discipline
- DEBUG: turned off in production by default. Local development noise.
- INFO: notable business events. "Order placed", "User signed up".
- WARN: something is unusual but not broken. Retries, fallbacks, deprecated API hits.
- ERROR: something broke. Always include a stack trace.
A common antipattern: logging every successful request at INFO. You drown in noise and your bill explodes. Successful traffic should show up in metrics, not logs.
Don't log secrets
Sanitize. Card numbers, passwords, tokens, PII. Use a structured logger that knows which fields to redact.
Traces: the microservices superpower
A trace represents one logical request as it moves through the system. Each step is a "span" — a piece of work with a start time, end time, and metadata.
Trace: checkout request from user 42
├─ api-gateway (5ms)
│ └─ auth-service (3ms)
├─ cart-service (15ms)
│ ├─ db-query users (4ms)
│ └─ db-query items (8ms)
├─ inventory-service (50ms)
│ └─ db-query stock (45ms) ← here's your latency
└─ payment-service (timed out at 5s)
In one view you see: which service was slow, which DB query was slow, and where it failed. No log-grepping across five services required.
How to instrument
Use OpenTelemetry. It's vendor-neutral, works in nearly every language, and exports to almost any backend (Jaeger, Tempo, Honeycomb, Datadog, New Relic).
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.get("/checkout")
def checkout(user_id: int):
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("user_id", user_id)
cart = get_cart(user_id) # creates child span
charge(cart) # creates child span
Auto-instrumentation libraries handle most of this for you for popular frameworks (FastAPI, Spring, Express). Hand-instrument the business operations that matter.
Sampling
Traces are expensive. Most production systems sample 1-10% of traces. Smart sampling keeps slow or errored traces with higher probability than fast successful ones — those are the ones you actually need to debug.
Tying them together: correlation IDs
A trace ID is a single ID that follows a request across every service it touches. Inject it into:
- Every log line that processes the request.
- Every metric where it makes sense (usually as exemplars, not labels — high cardinality kills metrics stores).
- Every outbound HTTP/gRPC call so the next service receives it.
Now: see a slow trace → click into it → see logs from every service for that one request. This is the workflow that makes microservices observable.
SLIs, SLOs, and error budgets
A maturity step beyond raw metrics:
- SLI (Service Level Indicator): a metric you actually care about. "Percentage of requests served in under 300ms."
- SLO (Service Level Objective): the target. "99.5% of requests served in under 300ms over a 30-day window."
- Error budget: the inverse. You're allowed to fail 0.5% of requests in 30 days. If you've burned half by day 10, slow down on risky deploys.
This framing turns reliability into a budget you spend, not an aspirational "be fast and reliable" goal.
Three rules for observability that pays off
- Instrument before you need it. The slow afternoon to add tracing to a healthy system is much cheaper than the 3am incident with no observability.
- Page on user pain, not infrastructure noise. "Error rate above 1%" is actionable; "CPU above 80%" is often not. Tie alerts to SLOs.
- Sample, aggregate, redact. The default settings of every observability tool will quadruple your bill. Be deliberate about what you store and at what fidelity.
What to read next
Observability is one piece of the operational puzzle. Microservices vs monolith covers when you should be running this many services in the first place. Event-driven architecture and distributed locks are two of the most common sources of "where did this go wrong" mysteries that tracing solves. The Uber HLD writeup is a useful applied example — at that scale of services, observability isn't optional infrastructure, it's the spine of the on-call experience.
Frequently asked questions
Do I really need traces for a small system?
If you have one service, no. If you have three or more services calling each other, yes — debugging without distributed tracing is much harder than instrumenting in the first place.
What about cost?
Observability bills are notoriously high. The trick is sampling traces (1-10% is usually enough), aggregating metrics rather than storing every event, and being deliberate about log verbosity. Log everything always = bankrupt.
Should I use OpenTelemetry?
Yes. It's vendor-neutral, well-supported, and lets you switch backends (Datadog, Honeycomb, Tempo, Jaeger) without rewriting instrumentation. Default for new systems.
Read next
Apr 19, 2026 · 6 min read
Event-Driven Architecture: When Events Pay Off (and When They Don't)
What event-driven really means, the patterns that work — events vs commands, choreography vs orchestration, sagas, outbox — and the failure modes nobody warns you about.
Apr 19, 2026 · 6 min read
Microservices vs Monolith: How to Actually Decide
When a monolith is the right answer, when microservices earn their keep, and why most teams pick wrong because they're optimizing for the wrong constraint.
Apr 19, 2026 · 6 min read
API Versioning Strategies: URL, Header, and the Trade-offs Nobody Tells You
URL versioning, header versioning, content negotiation, and 'no versioning at all' — what each costs, what each gets you, and how to pick a strategy you won't regret in three years.