Common Issues
Solutions to the most common problems when running InfraSage.
Services Won't Start
ClickHouse fails to start
Symptom: infrasage-clickhouse container exits immediately.
Causes and fixes:
-
Port 9000 already in use
lsof -i :9000# Kill the conflicting process or change the port in docker-compose.yml -
Insufficient disk space
df -h /var/lib/docker# Free up disk space; ClickHouse needs at least 10 GB -
SQL init scripts failed
docker logs infrasage-clickhouse | grep -i error# Check /scripts/sql/ for syntax errors
Ingestion Gateway can't connect to ClickHouse
Symptom: Gateway logs show clickhouse: connection refused or no such host.
Fix:
# Verify ClickHouse is healthy
docker exec infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev \
--query "SELECT 1"
# Check network connectivity from gateway container
docker exec infrasage-ingestion-gateway \
wget -qO- http://clickhouse:8123/ping
The gateway waits for ClickHouse to be healthy before starting (via depends_on condition). If it still fails, increase the healthcheck timeout.
Redpanda topic not created
Symptom: Gateway logs show topic 'raw-telemetry' does not exist.
Fix:
# Manually create the topic
docker exec infrasage-redpanda \
rpk topic create raw-telemetry --brokers localhost:29092 --partitions 3
No Data in ClickHouse
Symptom: You're sending telemetry but ClickHouse shows 0 rows.
Debugging steps:
-
Verify the gateway is receiving requests
curl -s 'http://localhost:9999/api/v1/query?query=infrasage_ingestion_events_total' | jq# Should show non-zero counter -
Check if events are failing validation
curl http://localhost:8080/api/v1/debug/dlq-stats# If DLQ is growing, your records are failing validation -
Check Kafka consumer lag
docker exec infrasage-redpanda rpk group describe infrasage-ingestion-group# If LAG is growing, the Telemetry Operator is not consuming fast enough -
Check Operator logs
docker logs infrasage-telemetry-operator | grep -i error
Anomalies Not Being Detected
Symptom: You're ingesting data with extreme values but no anomalies appear.
Causes:
-
Not enough historical data — the Watchdog needs at least 30 data points per metric to compute a baseline Z-score. Send data continuously for 5+ minutes.
-
Z-score threshold too high — lower
WATCHDOG_Z_SCORE_THRESHOLDfrom3.0to2.5. -
Watchdog interval is slow — the default is 60 seconds. Reduce
WATCHDOG_INTERVAL_SECONDS=10for testing. -
RCA cooldown active — if an anomaly was detected for this metric recently, it won't be re-detected until
WATCHDOG_RCA_COOLDOWN_MINUTESelapses (default: 15 min).
# Force anomaly for testing
curl -X POST http://localhost:8080/api/v1/telemetry \
-H "Content-Type: application/json" \
-d '{
"service_id": "test-service",
"metric_name": "cpu_usage_percent",
"value": 9999.0,
"timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}'
RCA Not Triggering
Symptom: Anomalies appear in ClickHouse, but no RCA results.
Causes:
-
ANTHROPIC_API_KEYnot set — RCA requires a valid Anthropic API key.echo $ANTHROPIC_API_KEY# Should be non-empty and start with sk-ant- -
Anthropic API error — check AIops Engine logs:
docker logs infrasage-aiops-engine | grep -i "anthropic\|llm\|rca" -
RCA cooldown — verify no other RCA for the same service/metric ran recently.
docker exec infrasage-clickhouse clickhouse-client \--user infrasage --password infrasage-dev \--query "SELECT analyzed_at FROM infrasage.infrasage_rca_resultsWHERE service_id = 'test-service' ORDER BY analyzed_at DESC LIMIT 1"
Rate Limit Errors (HTTP 429)
Symptom: Ingestion returns 429 Too Many Requests.
Causes:
-
Per-key rate limit exceeded — increase
rate_limit_rpson the API key, or create additional keys with shared rate limits. -
Tenant plan quota exceeded — check monthly event usage:
curl http://localhost:8080/api/v1/usage -H "Authorization: Bearer $JWT"Upgrade your plan or wait for the next billing period.
-
Sending too fast — implement exponential backoff in your ingestion client:
import timefor attempt in range(5):resp = requests.post(url, json=payload)if resp.status_code != 429:breaktime.sleep(2 ** attempt)
High Kafka Consumer Lag
Symptom: rpk group describe shows growing LAG for infrasage-ingestion-group.
Fixes:
- Increase
OPERATOR_WORKER_COUNT(e.g., from2to4) - Increase Kafka partition count and scale Telemetry Operator replicas
- Check ClickHouse write performance:
docker exec infrasage-clickhouse clickhouse-client \--query "SELECT * FROM system.metrics WHERE metric LIKE '%Write%'"
Grafana Dashboards Show No Data
Symptom: Grafana dashboards are empty or show "No data".
Fixes:
-
Prometheus data source not configured — In Grafana: Settings → Data Sources → Add Prometheus → URL: http://prometheus:9090
-
Prometheus not scraping InfraSage — Check
http://localhost:9999/targets— all targets should beUP. -
Dashboards not provisioned — Restart Grafana:
docker-compose restart grafana