Full-Stack Observability
Date: 2026-03-13 Status: accepted
Context
LuckyPlans has zero monitoring infrastructure — no structured logging, no metrics collection, and no distributed tracing. As the platform grows (API gateway, microservices, Redis, PostgreSQL, Keycloak), we need visibility into system health, performance, and errors. Without observability, debugging requires SSH + manual log tailing, there is no alerting, no request tracing across services, and no capacity planning data.
Decision
Add a comprehensive observability stack covering the three pillars — metrics, logs, and traces — using open-source, self-hosted tools:
| Component | Role |
|---|---|
| Prometheus | Metrics collection and storage (scrapes OTel Collector, Redis Exporter, Keycloak) |
| Grafana | Unified visualization for metrics, logs, and traces |
| Loki | Log aggregation (lightweight alternative to ELK) |
| Tempo | Distributed tracing backend (native Grafana integration) |
| OpenTelemetry Collector | Telemetry pipeline — receives OTLP from apps, exports to Prometheus/Loki/Tempo |
| Promtail | Log shipper — ships pod logs to Loki |
NestJS services are instrumented with the OpenTelemetry Node.js SDK, which auto-instruments HTTP, Express, ioredis, and GraphQL. Structured logging uses Pino (via nestjs-pino) with JSON output and trace context correlation.
The observability stack is deployed as a separate Helm chart (infrastructure/helm/observability/) in a dedicated monitoring namespace, managed by its own ArgoCD Application.
Consequences
Benefits:
- End-to-end request tracing: web → gateway → Redis → service-core
- RED metrics (Rate, Errors, Duration) for all services
- Structured JSON logs with trace IDs for correlation
- Alerting on service down, high error rate, pod restart loops
- Trace-to-log and log-to-trace correlation in Grafana
Trade-offs:
- Additional resource usage (~900Mi memory on k3s single-node)
- More infrastructure to maintain (7 new components)
- OTel auto-instrumentation adds small latency overhead to requests
Alternatives considered:
- ELK Stack — too heavy for k3s single-node
- Jaeger — Tempo is lighter, has native Grafana integration, no separate UI
- Datadog/New Relic — cost, vendor lock-in, project is self-hosted
- Direct export (no OTel Collector) — Collector provides buffering, retry, and fan-out