Skip to main content

Production Readiness Checklist

Use this checklist before going live. Items are grouped by category; each links to the relevant documentation.


Infrastructure

  • ClickHouse is deployed with persistent volumes (not emptyDir)
  • ClickHouse data directory is backed up on a schedule (daily minimum)
  • Kafka is deployed with at least 3 brokers and replication factor ≥ 2
  • All InfraSage pods have resource requests and limits set
  • PodDisruptionBudgets are configured for ClickHouse and Kafka
  • Node affinity or taints prevent InfraSage pods from competing with your workloads for resources
  • Persistent volume claims use a storage class with WaitForFirstConsumer binding

Secrets & Credentials

  • JWT_SECRET is a randomly generated 256-bit value (not a default or example value)
  • ClickHouse credentials are not the defaults (clickhouse / clickhouse)
  • API keys are rotated from any test keys used during onboarding
  • Secrets are managed via Kubernetes Secrets (not hardcoded in ConfigMaps or Helm values files committed to git)
  • ANTHROPIC_API_KEY is stored as a Secret, not an environment variable in the Deployment spec
  • Secret rotation schedule is documented

Network & TLS

  • mTLS is enabled between InfraSage services (MTLS_ENABLED=true)
  • Ingestion Gateway is behind a load balancer or ingress with TLS termination
  • Admin UI is not exposed to the public internet (VPN or internal ingress only)
  • NetworkPolicies restrict inter-namespace traffic
  • Outbound egress is restricted to required endpoints only (see Data Residency)

Authentication & Authorization

  • Default tenant is not used in production (create named tenants)
  • Every service ingesting telemetry uses a dedicated API key with ingestion scope only
  • Human users are assigned the minimum required RBAC role (Viewer / Operator, not Admin)
  • Super-Admin role is assigned to no more than 2 accounts
  • API key expiration is configured for all non-system keys
  • Audit logging is enabled and audit log retention is ≥ 90 days

Scaling & Capacity

  • Scale profile matches expected event volume (small / medium / large) — see Scale Profiles
  • ClickHouse storage is provisioned at ≥ 2× current data volume (for headroom)
  • Kafka retention is set to at least 24 hours (covers outage + replay)
  • Ingestion Gateway HPA is configured with a CPU/RPS target
  • AIops Engine replica count ≥ 2 for HA
  • ClickHouse memory limit is at least 4× the largest expected query result set

Retention & Compliance

  • Retention policy is configured per telemetry type (see Retention Policy)
  • Retention matches your compliance obligation (e.g., 1 year for financial services audit trails)
  • Log field exclusions are configured to strip PII before storage (LOG_FIELD_EXCLUSIONS)
  • Data residency requirements are satisfied — all ClickHouse volumes are in the correct region

Monitoring InfraSage Itself

  • Prometheus is scraping InfraSage's /metrics endpoints
  • Grafana dashboards are imported (System Health, Ingestion, Anomaly Detection)
  • Alertmanager rules are configured for:
    • Kafka consumer lag > threshold
    • Ingestion Gateway error rate > 1%
    • ClickHouse disk usage > 75%
    • AIops Engine unhealthy
  • On-call rotation includes InfraSage health as a monitored signal

Alerting & Incident Response

  • At least one notification channel is configured (Slack, PagerDuty, or webhook)
  • Anomaly detection sensitivity is tuned for your services (not left at default 3.0 for all)
  • Runbooks are tested in dry-run mode before enabling auto-execution
  • Runbook approval flow is configured for destructive actions (pod restarts, scaling)
  • Incident feedback loop is enabled so alert quality improves over time

Disaster Recovery

  • ClickHouse backup restoration has been tested (not just scheduled)
  • Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are documented
  • Procedure for restarting the full InfraSage stack from scratch is documented and tested
  • Kafka topic recreation steps are documented

Load Testing

  • Ingestion Gateway has been load tested at ≥ 2× expected peak throughput
  • Anomaly detection has been validated against known synthetic anomalies
  • RCA has been triggered at least once end-to-end in a staging environment
  • A runbook has been executed in staging (dry-run and live)

Go-Live Signoff

Run the built-in health check before going live:

curl http://infrasage-gateway.internal/health
curl http://infrasage-aiops.internal/health
curl http://infrasage-ml.internal/health

Expected response from each:

{ "status": "ok", "version": "x.y.z" }

Check Kafka consumer lag:

kubectl exec -n infrasage deploy/kafka -- \
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe \
--group telemetry-operator

Lag should be 0 or near-0 in a steady state.


See also: Troubleshooting · Scale Profiles · Security