Do I really need traces for a small system?

If you have one service, no. If you have three or more services calling each other, yes — debugging without distributed tracing is much harder than instrumenting in the first place.

Observability bills are notoriously high. The trick is sampling traces (1-10% is usually enough), aggregating metrics rather than storing every event, and being deliberate about log verbosity. Log everything always = bankrupt.

Should I use OpenTelemetry?

Yes. It's vendor-neutral, well-supported, and lets you switch backends (Datadog, Honeycomb, Tempo, Jaeger) without rewriting instrumentation. Default for new systems.

Observability for Microservices: Logs, Metrics, and Traces That Actually Help

When a single user's checkout fails at 3am, observability is the difference between "fixed in 15 minutes" and "still investigating at noon". For monoliths it's straightforward — read the logs. For microservices, where one user request fans out across a dozen services, you need something better.

This post covers what to instrument, how the three pillars (logs, metrics, traces) fit together, and the patterns that actually help on-call engineers.

The three pillars, briefly

The standard framing:

Metrics — numerical aggregates over time. "Requests per second", "p99 latency", "error rate". Cheap to store, fast to query, perfect for dashboards and alerts.
Logs — discrete events with arbitrary detail. "User 42 hit the /checkout endpoint and we returned 500 because Stripe timed out." Expensive to store, rich in context.
Traces — a representation of one request's journey across services. "User 42's request: API gateway → cart service → inventory service (took 2s) → payment service (timed out)."

You need all three. Each one answers a different question:

Metrics: is something wrong? (alerts, dashboards)
Logs: what exactly happened in this one case? (deep debugging)
Traces: which service caused the problem? (distributed root cause)

Metrics: the foundation

Three metric types cover most needs:

Counters

Monotonically increasing values. "Total HTTP requests served". Differentiated to get rates: "requests per second".

http_requests_total.labels(method="GET", route="/users", status="200").inc()

Gauges

Point-in-time values. "Current open connections", "queue depth", "CPU temperature".

Histograms

Distributions of values. "Request duration in seconds". Lets you compute p50, p95, p99 percentiles.

request_duration_seconds.labels(route="/users").observe(0.234)

The four golden signals

The minimum every service should expose, popularized by Google's SRE book:

Latency. How long requests take, broken down by success/failure.
Traffic. Requests per second.
Errors. Rate of failed requests.
Saturation. How "full" the service is — CPU, memory, queue depth.

Alert on these. Build dashboards on these. If a service has nothing else, it should at least have these four.

RED and USE methods

Two complementary sets:

RED (for request-driven services): Rate, Errors, Duration.
USE (for resources): Utilization, Saturation, Errors.

Combine them — RED for endpoints, USE for the underlying CPU/disk/network/queue.

Logs: rich detail, used sparingly

Logs are where you record the things metrics can't capture: error stack traces, request bodies (sanitized), business-context information.

Key practices:

Structured logs

Plain text logs are useless at scale. Use JSON.

{
  "timestamp": "2026-04-19T03:14:22Z",
  "level": "error",
  "service": "checkout",
  "trace_id": "5b7c9...",
  "span_id": "a3f1...",
  "user_id": "42",
  "order_id": "ord_abc",
  "message": "stripe charge failed",
  "error": "TimeoutError: connection reset"
}

Now you can query: "show me all errors for user 42", "show me everything in trace 5b7c9", "show me all stripe timeouts in the last hour".

Log levels and discipline

DEBUG: turned off in production by default. Local development noise.
INFO: notable business events. "Order placed", "User signed up".
WARN: something is unusual but not broken. Retries, fallbacks, deprecated API hits.
ERROR: something broke. Always include a stack trace.

A common antipattern: logging every successful request at INFO. You drown in noise and your bill explodes. Successful traffic should show up in metrics, not logs.

Don't log secrets

Sanitize. Card numbers, passwords, tokens, PII. Use a structured logger that knows which fields to redact.

Traces: the microservices superpower

A trace represents one logical request as it moves through the system. Each step is a "span" — a piece of work with a start time, end time, and metadata.

Trace: checkout request from user 42
├─ api-gateway (5ms)
│   └─ auth-service (3ms)
├─ cart-service (15ms)
│   ├─ db-query users (4ms)
│   └─ db-query items (8ms)
├─ inventory-service (50ms)
│   └─ db-query stock (45ms)  ← here's your latency
└─ payment-service (timed out at 5s)

In one view you see: which service was slow, which DB query was slow, and where it failed. No log-grepping across five services required.

How to instrument

Use OpenTelemetry. It's vendor-neutral, works in nearly every language, and exports to almost any backend (Jaeger, Tempo, Honeycomb, Datadog, New Relic).

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

@app.get("/checkout")
def checkout(user_id: int):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user_id", user_id)
        cart = get_cart(user_id)  # creates child span
        charge(cart)              # creates child span

Auto-instrumentation libraries handle most of this for you for popular frameworks (FastAPI, Spring, Express). Hand-instrument the business operations that matter.

Sampling

Traces are expensive. Most production systems sample 1-10% of traces. Smart sampling keeps slow or errored traces with higher probability than fast successful ones — those are the ones you actually need to debug.

Tying them together: correlation IDs

A trace ID is a single ID that follows a request across every service it touches. Inject it into:

Every log line that processes the request.
Every metric where it makes sense (usually as exemplars, not labels — high cardinality kills metrics stores).
Every outbound HTTP/gRPC call so the next service receives it.

Now: see a slow trace → click into it → see logs from every service for that one request. This is the workflow that makes microservices observable.

SLIs, SLOs, and error budgets

A maturity step beyond raw metrics:

SLI (Service Level Indicator): a metric you actually care about. "Percentage of requests served in under 300ms."
SLO (Service Level Objective): the target. "99.5% of requests served in under 300ms over a 30-day window."
Error budget: the inverse. You're allowed to fail 0.5% of requests in 30 days. If you've burned half by day 10, slow down on risky deploys.

This framing turns reliability into a budget you spend, not an aspirational "be fast and reliable" goal.

Three rules for observability that pays off

Instrument before you need it. The slow afternoon to add tracing to a healthy system is much cheaper than the 3am incident with no observability.
Page on user pain, not infrastructure noise. "Error rate above 1%" is actionable; "CPU above 80%" is often not. Tie alerts to SLOs.
Sample, aggregate, redact. The default settings of every observability tool will quadruple your bill. Be deliberate about what you store and at what fidelity.

What to read next

Observability is one piece of the operational puzzle. Microservices vs monolith covers when you should be running this many services in the first place. Event-driven architecture and distributed locks are two of the most common sources of "where did this go wrong" mysteries that tracing solves. The Uber HLD writeup is a useful applied example — at that scale of services, observability isn't optional infrastructure, it's the spine of the on-call experience.