Skip to main content

Common Issues

Solutions to the most common problems when running InfraSage.


Services Won't Start

ClickHouse fails to start

Symptom: infrasage-clickhouse container exits immediately.

Causes and fixes:

  1. Port 9000 already in use

    lsof -i :9000
    # Kill the conflicting process or change the port in docker-compose.yml
  2. Insufficient disk space

    df -h /var/lib/docker
    # Free up disk space; ClickHouse needs at least 10 GB
  3. SQL init scripts failed

    docker logs infrasage-clickhouse | grep -i error
    # Check /scripts/sql/ for syntax errors

Ingestion Gateway can't connect to ClickHouse

Symptom: Gateway logs show clickhouse: connection refused or no such host.

Fix:

# Verify ClickHouse is healthy
docker exec infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev \
--query "SELECT 1"

# Check network connectivity from gateway container
docker exec infrasage-ingestion-gateway \
wget -qO- http://clickhouse:8123/ping

The gateway waits for ClickHouse to be healthy before starting (via depends_on condition). If it still fails, increase the healthcheck timeout.

Redpanda topic not created

Symptom: Gateway logs show topic 'raw-telemetry' does not exist.

Fix:

# Manually create the topic
docker exec infrasage-redpanda \
rpk topic create raw-telemetry --brokers localhost:29092 --partitions 3

No Data in ClickHouse

Symptom: You're sending telemetry but ClickHouse shows 0 rows.

Debugging steps:

  1. Verify the gateway is receiving requests

    curl -s 'http://localhost:9999/api/v1/query?query=infrasage_ingestion_events_total' | jq
    # Should show non-zero counter
  2. Check if events are failing validation

    curl http://localhost:8080/api/v1/debug/dlq-stats
    # If DLQ is growing, your records are failing validation
  3. Check Kafka consumer lag

    docker exec infrasage-redpanda rpk group describe infrasage-ingestion-group
    # If LAG is growing, the Telemetry Operator is not consuming fast enough
  4. Check Operator logs

    docker logs infrasage-telemetry-operator | grep -i error

Anomalies Not Being Detected

Symptom: You're ingesting data with extreme values but no anomalies appear.

Causes:

  1. Not enough historical data — the Watchdog needs at least 30 data points per metric to compute a baseline Z-score. Send data continuously for 5+ minutes.

  2. Z-score threshold too high — lower WATCHDOG_Z_SCORE_THRESHOLD from 3.0 to 2.5.

  3. Watchdog interval is slow — the default is 60 seconds. Reduce WATCHDOG_INTERVAL_SECONDS=10 for testing.

  4. RCA cooldown active — if an anomaly was detected for this metric recently, it won't be re-detected until WATCHDOG_RCA_COOLDOWN_MINUTES elapses (default: 15 min).

# Force anomaly for testing
curl -X POST http://localhost:8080/api/v1/telemetry \
-H "Content-Type: application/json" \
-d '{
"service_id": "test-service",
"metric_name": "cpu_usage_percent",
"value": 9999.0,
"timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}'

RCA Not Triggering

Symptom: Anomalies appear in ClickHouse, but no RCA results.

Causes:

  1. ANTHROPIC_API_KEY not set — RCA requires a valid Anthropic API key.

    echo $ANTHROPIC_API_KEY
    # Should be non-empty and start with sk-ant-
  2. Anthropic API error — check AIops Engine logs:

    docker logs infrasage-aiops-engine | grep -i "anthropic\|llm\|rca"
  3. RCA cooldown — verify no other RCA for the same service/metric ran recently.

    docker exec infrasage-clickhouse clickhouse-client \
    --user infrasage --password infrasage-dev \
    --query "SELECT analyzed_at FROM infrasage.infrasage_rca_results
    WHERE service_id = 'test-service' ORDER BY analyzed_at DESC LIMIT 1"

Rate Limit Errors (HTTP 429)

Symptom: Ingestion returns 429 Too Many Requests.

Causes:

  1. Per-key rate limit exceeded — increase rate_limit_rps on the API key, or create additional keys with shared rate limits.

  2. Tenant plan quota exceeded — check monthly event usage:

    curl http://localhost:8080/api/v1/usage -H "Authorization: Bearer $JWT"

    Upgrade your plan or wait for the next billing period.

  3. Sending too fast — implement exponential backoff in your ingestion client:

    import time
    for attempt in range(5):
    resp = requests.post(url, json=payload)
    if resp.status_code != 429:
    break
    time.sleep(2 ** attempt)

High Kafka Consumer Lag

Symptom: rpk group describe shows growing LAG for infrasage-ingestion-group.

Fixes:

  1. Increase OPERATOR_WORKER_COUNT (e.g., from 2 to 4)
  2. Increase Kafka partition count and scale Telemetry Operator replicas
  3. Check ClickHouse write performance:
    docker exec infrasage-clickhouse clickhouse-client \
    --query "SELECT * FROM system.metrics WHERE metric LIKE '%Write%'"

Grafana Dashboards Show No Data

Symptom: Grafana dashboards are empty or show "No data".

Fixes:

  1. Prometheus data source not configured — In Grafana: Settings → Data Sources → Add Prometheus → URL: http://prometheus:9090

  2. Prometheus not scraping InfraSage — Check http://localhost:9999/targets — all targets should be UP.

  3. Dashboards not provisioned — Restart Grafana: docker-compose restart grafana