Full-Stack Observability

Date: 2026-03-13 Status: accepted

Context

LuckyPlans has zero monitoring infrastructure — no structured logging, no metrics collection, and no distributed tracing. As the platform grows (API gateway, microservices, Redis, PostgreSQL, Keycloak), we need visibility into system health, performance, and errors. Without observability, debugging requires SSH + manual log tailing, there is no alerting, no request tracing across services, and no capacity planning data.

Decision

Add a comprehensive observability stack covering the three pillars — metrics, logs, and traces — using open-source, self-hosted tools:

ComponentRole
PrometheusMetrics collection and storage (scrapes OTel Collector, Redis Exporter, Keycloak)
GrafanaUnified visualization for metrics, logs, and traces
LokiLog aggregation (lightweight alternative to ELK)
TempoDistributed tracing backend (native Grafana integration)
OpenTelemetry CollectorTelemetry pipeline — receives OTLP from apps, exports to Prometheus/Loki/Tempo
PromtailLog shipper — ships pod logs to Loki

NestJS services are instrumented with the OpenTelemetry Node.js SDK, which auto-instruments HTTP, Express, ioredis, and GraphQL. Structured logging uses Pino (via nestjs-pino) with JSON output and trace context correlation.

The observability stack is deployed as a separate Helm chart (infrastructure/helm/observability/) in a dedicated monitoring namespace, managed by its own ArgoCD Application.

Consequences

Benefits:

  • End-to-end request tracing: web → gateway → Redis → service-core
  • RED metrics (Rate, Errors, Duration) for all services
  • Structured JSON logs with trace IDs for correlation
  • Alerting on service down, high error rate, pod restart loops
  • Trace-to-log and log-to-trace correlation in Grafana

Trade-offs:

  • Additional resource usage (~900Mi memory on k3s single-node)
  • More infrastructure to maintain (7 new components)
  • OTel auto-instrumentation adds small latency overhead to requests

Alternatives considered:

  • ELK Stack — too heavy for k3s single-node
  • Jaeger — Tempo is lighter, has native Grafana integration, no separate UI
  • Datadog/New Relic — cost, vendor lock-in, project is self-hosted
  • Direct export (no OTel Collector) — Collector provides buffering, retry, and fan-out