Production Readiness Checklist
Use this checklist before going live. Items are grouped by category; each links to the relevant documentation.
Infrastructure
- ClickHouse is deployed with persistent volumes (not
emptyDir) - ClickHouse data directory is backed up on a schedule (daily minimum)
- Kafka is deployed with at least 3 brokers and replication factor ≥ 2
- All InfraSage pods have resource requests and limits set
- PodDisruptionBudgets are configured for ClickHouse and Kafka
- Node affinity or taints prevent InfraSage pods from competing with your workloads for resources
- Persistent volume claims use a storage class with
WaitForFirstConsumerbinding
Secrets & Credentials
-
JWT_SECRETis a randomly generated 256-bit value (not a default or example value) - ClickHouse credentials are not the defaults (
clickhouse/clickhouse) - API keys are rotated from any test keys used during onboarding
- Secrets are managed via Kubernetes Secrets (not hardcoded in ConfigMaps or Helm values files committed to git)
-
ANTHROPIC_API_KEYis stored as a Secret, not an environment variable in the Deployment spec - Secret rotation schedule is documented
Network & TLS
- mTLS is enabled between InfraSage services (
MTLS_ENABLED=true) - Ingestion Gateway is behind a load balancer or ingress with TLS termination
- Admin UI is not exposed to the public internet (VPN or internal ingress only)
- NetworkPolicies restrict inter-namespace traffic
- Outbound egress is restricted to required endpoints only (see Data Residency)
Authentication & Authorization
- Default tenant is not used in production (create named tenants)
- Every service ingesting telemetry uses a dedicated API key with
ingestionscope only - Human users are assigned the minimum required RBAC role (Viewer / Operator, not Admin)
- Super-Admin role is assigned to no more than 2 accounts
- API key expiration is configured for all non-system keys
- Audit logging is enabled and audit log retention is ≥ 90 days
Scaling & Capacity
- Scale profile matches expected event volume (
small/medium/large) — see Scale Profiles - ClickHouse storage is provisioned at ≥ 2× current data volume (for headroom)
- Kafka retention is set to at least 24 hours (covers outage + replay)
- Ingestion Gateway HPA is configured with a CPU/RPS target
- AIops Engine replica count ≥ 2 for HA
- ClickHouse memory limit is at least 4× the largest expected query result set
Retention & Compliance
- Retention policy is configured per telemetry type (see Retention Policy)
- Retention matches your compliance obligation (e.g., 1 year for financial services audit trails)
- Log field exclusions are configured to strip PII before storage (
LOG_FIELD_EXCLUSIONS) - Data residency requirements are satisfied — all ClickHouse volumes are in the correct region
Monitoring InfraSage Itself
- Prometheus is scraping InfraSage's
/metricsendpoints - Grafana dashboards are imported (System Health, Ingestion, Anomaly Detection)
- Alertmanager rules are configured for:
- Kafka consumer lag > threshold
- Ingestion Gateway error rate > 1%
- ClickHouse disk usage > 75%
- AIops Engine unhealthy
- On-call rotation includes InfraSage health as a monitored signal
Alerting & Incident Response
- At least one notification channel is configured (Slack, PagerDuty, or webhook)
- Anomaly detection sensitivity is tuned for your services (not left at default
3.0for all) - Runbooks are tested in dry-run mode before enabling auto-execution
- Runbook approval flow is configured for destructive actions (pod restarts, scaling)
- Incident feedback loop is enabled so alert quality improves over time
Disaster Recovery
- ClickHouse backup restoration has been tested (not just scheduled)
- Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are documented
- Procedure for restarting the full InfraSage stack from scratch is documented and tested
- Kafka topic recreation steps are documented
Load Testing
- Ingestion Gateway has been load tested at ≥ 2× expected peak throughput
- Anomaly detection has been validated against known synthetic anomalies
- RCA has been triggered at least once end-to-end in a staging environment
- A runbook has been executed in staging (dry-run and live)
Go-Live Signoff
Run the built-in health check before going live:
curl http://infrasage-gateway.internal/health
curl http://infrasage-aiops.internal/health
curl http://infrasage-ml.internal/health
Expected response from each:
{ "status": "ok", "version": "x.y.z" }
Check Kafka consumer lag:
kubectl exec -n infrasage deploy/kafka -- \
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe \
--group telemetry-operator
Lag should be 0 or near-0 in a steady state.
See also: Troubleshooting · Scale Profiles · Security