HLD9 min read

Design a Distributed Logging System

Aggregated logging at TB/day: agent → broker → indexed store, with structured search, retention tiers, and rate-limit-aware ingestion.

hldsystem-designobservability

Intro

A distributed logging system aggregates logs from thousands of services and millions of containers. The architectural challenges are: (1) ingest at TB/day without dropping logs even during incidents, (2) structured search across the whole corpus, and (3) retention tiers (hot, warm, cold). Think Datadog / Elastic Stack / Splunk.

Functional

Agents collect logs from services + containers + servers.
Aggregator forwards to backend with retry + buffering.
Index for full-text + structured search.
Retention tiers: hot 7 days, warm 30 days, cold 1 year.

Non-functional

Ingest 1+ TB/day per region.
Search p95 < 1 s on hot index.
End-to-end latency p95 < 30 s (log → searchable).
Zero log loss under normal conditions; bounded loss during overload.

Components

Log agent
Reads files / stdout / journald; tags + ships to broker.
Aggregator
Receives from agents; batches + dedup; forwards to indexer.
Indexer
Inverted index for text + structured fields (Elasticsearch / OpenSearch).
Object store
Long-term archive (S3 + Parquet).
Search API
Query the hot index + cold archive.
Rate limiter
Per-tenant ingestion quotas.

Trade-offs

Push (agent → broker) vs pull (broker scrapes)

Pros

Push: real-time.
Pull: agent simpler.

Cons

Push: agent needs buffering.
Pull: latency.

Index everything vs schema-on-read