Production Readiness Checklist

Use this checklist before going live. Items are grouped by category; each links to the relevant documentation.

Infrastructure

ClickHouse is deployed with persistent volumes (not emptyDir)
ClickHouse data directory is backed up on a schedule (daily minimum)
Kafka is deployed with at least 3 brokers and replication factor ≥ 2
All InfraSage pods have resource requests and limits set
PodDisruptionBudgets are configured for ClickHouse and Kafka
Node affinity or taints prevent InfraSage pods from competing with your workloads for resources
Persistent volume claims use a storage class with WaitForFirstConsumer binding

Secrets & Credentials

JWT_SECRET is a randomly generated 256-bit value (not a default or example value)
ClickHouse credentials are not the defaults (clickhouse / clickhouse)
API keys are rotated from any test keys used during onboarding
Secrets are managed via Kubernetes Secrets (not hardcoded in ConfigMaps or Helm values files committed to git)
ANTHROPIC_API_KEY is stored as a Secret, not an environment variable in the Deployment spec
Secret rotation schedule is documented

Network & TLS

mTLS is enabled between InfraSage services (MTLS_ENABLED=true)
Ingestion Gateway is behind a load balancer or ingress with TLS termination
Admin UI is not exposed to the public internet (VPN or internal ingress only)
NetworkPolicies restrict inter-namespace traffic
Outbound egress is restricted to required endpoints only (see Data Residency)

Authentication & Authorization

Default tenant is not used in production (create named tenants)
Every service ingesting telemetry uses a dedicated API key with ingestion scope only
Human users are assigned the minimum required RBAC role (Viewer / Operator, not Admin)
Super-Admin role is assigned to no more than 2 accounts
API key expiration is configured for all non-system keys
Audit logging is enabled and audit log retention is ≥ 90 days

Scaling & Capacity

Scale profile matches expected event volume (small / medium / large) — see Scale Profiles
ClickHouse storage is provisioned at ≥ 2× current data volume (for headroom)
Kafka retention is set to at least 24 hours (covers outage + replay)
Ingestion Gateway HPA is configured with a CPU/RPS target
AIops Engine replica count ≥ 2 for HA
ClickHouse memory limit is at least 4× the largest expected query result set

Retention & Compliance

Retention policy is configured per telemetry type (see Retention Policy)
Retention matches your compliance obligation (e.g., 1 year for financial services audit trails)
Log field exclusions are configured to strip PII before storage (LOG_FIELD_EXCLUSIONS)
Data residency requirements are satisfied — all ClickHouse volumes are in the correct region

Monitoring InfraSage Itself

Prometheus is scraping InfraSage's /metrics endpoints
Grafana dashboards are imported (System Health, Ingestion, Anomaly Detection)
Alertmanager rules are configured for:
- Kafka consumer lag > threshold
- Ingestion Gateway error rate > 1%
- ClickHouse disk usage > 75%
- AIops Engine unhealthy
On-call rotation includes InfraSage health as a monitored signal

Alerting & Incident Response

At least one notification channel is configured (Slack, PagerDuty, or webhook)
Anomaly detection sensitivity is tuned for your services (not left at default 3.0 for all)
Runbooks are tested in dry-run mode before enabling auto-execution
Runbook approval flow is configured for destructive actions (pod restarts, scaling)
Incident feedback loop is enabled so alert quality improves over time

Disaster Recovery

ClickHouse backup restoration has been tested (not just scheduled)
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are documented
Procedure for restarting the full InfraSage stack from scratch is documented and tested
Kafka topic recreation steps are documented

Load Testing

Ingestion Gateway has been load tested at ≥ 2× expected peak throughput
Anomaly detection has been validated against known synthetic anomalies
RCA has been triggered at least once end-to-end in a staging environment
A runbook has been executed in staging (dry-run and live)

Go-Live Signoff

Run the built-in health check before going live:

curl http://infrasage-gateway.internal/health
curl http://infrasage-aiops.internal/health
curl http://infrasage-ml.internal/health

Expected response from each:

{ "status": "ok", "version": "x.y.z" }

Check Kafka consumer lag:

kubectl exec -n infrasage deploy/kafka -- \
  kafka-consumer-groups.sh \
    --bootstrap-server localhost:9092 \
    --describe \
    --group telemetry-operator

Lag should be 0 or near-0 in a steady state.

See also: Troubleshooting · Scale Profiles · Security

Infrastructure
Secrets & Credentials
Network & TLS
Authentication & Authorization
Scaling & Capacity
Retention & Compliance
Monitoring InfraSage Itself
Alerting & Incident Response
Disaster Recovery
Load Testing
Go-Live Signoff

Infrastructure​

Secrets & Credentials​

Network & TLS​

Authentication & Authorization​

Scaling & Capacity​

Retention & Compliance​

Monitoring InfraSage Itself​

Alerting & Incident Response​

Disaster Recovery​

Load Testing​

Go-Live Signoff​