Files
Paperclip CTO 3d541d818a feat(TRA-249): M5 observability, SLOs, backup, and release readiness
- Add prometheus-client to base requirements; sentry-sdk to prod
- api/metrics.py: define HTTP latency histogram, request/error counters, in-flight gauge
- api/middleware.py: extend SecurityAuditMiddleware to observe all four Prometheus collectors per request; low-cardinality path_template label via URL resolver
- api/views.py: /metrics/ endpoint (gated by METRICS_ENABLED setting)
- api/urls.py: wire /metrics/ route
- config/settings/prod.py: METRICS_ENABLED flag; optional Sentry SDK init via SENTRY_DSN env var
- ops/prometheus/alerts.yml: Prometheus alert rules for p95 latency SLO (≤500 ms), error rate SLO (<1%), availability, and saturation
- ops/prometheus/prometheus.yml: scrape config for app + blackbox healthcheck probe
- ops/scripts/backup.sh: pg_dump → S3 STANDARD_IA with retention metadata
- ops/scripts/restore.sh: pg_restore from S3 or local file with interactive confirmation guard
- ops/scripts/synthetic-check.sh: post-deploy smoke test (healthz, metrics gate, schema, 404 shape)
- docs/TRA-249-observability-slos.md: SLO table, PromQL reference queries, alert routing
- docs/TRA-249-backup-restore.md: RPO/RTO targets, drill procedure, restore validation steps
- docs/TRA-249-release-checklist.md: pre/post-deploy checklist
- docs/TRA-249-rollback-runbook.md: decision matrix, app rollback, migration revert, DB restore path

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-05-07 09:14:18 +02:00
..