Skip to content
Jarviix
HLD9 min read

Design a Distributed Logging System

Aggregated logging at TB/day: agent → broker → indexed store, with structured search, retention tiers, and rate-limit-aware ingestion.

hldsystem-designobservability

Intro

A distributed logging system aggregates logs from thousands of services and millions of containers. The architectural challenges are: (1) ingest at TB/day without dropping logs even during incidents, (2) structured search across the whole corpus, and (3) retention tiers (hot, warm, cold). Think Datadog / Elastic Stack / Splunk.

Functional

  • Agents collect logs from services + containers + servers.
  • Aggregator forwards to backend with retry + buffering.
  • Index for full-text + structured search.
  • Retention tiers: hot 7 days, warm 30 days, cold 1 year.

Non-functional

  • Ingest 1+ TB/day per region.
  • Search p95 < 1 s on hot index.
  • End-to-end latency p95 < 30 s (log → searchable).
  • Zero log loss under normal conditions; bounded loss during overload.

Components

  • Log agent

    Reads files / stdout / journald; tags + ships to broker.

  • Aggregator

    Receives from agents; batches + dedup; forwards to indexer.

  • Indexer

    Inverted index for text + structured fields (Elasticsearch / OpenSearch).

  • Object store

    Long-term archive (S3 + Parquet).

  • Search API

    Query the hot index + cold archive.

  • Rate limiter

    Per-tenant ingestion quotas.

Trade-offs

Push (agent → broker) vs pull (broker scrapes)

Pros

  • Push: real-time.
  • Pull: agent simpler.

Cons

  • Push: agent needs buffering.
  • Pull: latency.

Index everything vs schema-on-read

Pros

  • Index: fast queries.
  • Schema-on-read: cheap storage.

Cons

  • Index: storage explosion.
  • Schema-on-read: slow scans.

Scale concerns

  • Incident burst (10× normal volume).
  • Hot tenant — quota + isolation.
  • Index hot spot.
  • Retention rollover.

Related reads