Common Issues

Solutions to the most common problems when running InfraSage.

Services Won't Start

ClickHouse fails to start

Symptom: infrasage-clickhouse container exits immediately.

Causes and fixes:

Port 9000 already in use

lsof -i :9000
# Kill the conflicting process or change the port in docker-compose.yml

Insufficient disk space

df -h /var/lib/docker
# Free up disk space; ClickHouse needs at least 10 GB

SQL init scripts failed

docker logs infrasage-clickhouse | grep -i error
# Check /scripts/sql/ for syntax errors

Ingestion Gateway can't connect to ClickHouse

Symptom: Gateway logs show clickhouse: connection refused or no such host.

Fix:

# Verify ClickHouse is healthy
docker exec infrasage-clickhouse clickhouse-client \
  --user infrasage --password infrasage-dev \
  --query "SELECT 1"

# Check network connectivity from gateway container
docker exec infrasage-ingestion-gateway \
  wget -qO- http://clickhouse:8123/ping

The gateway waits for ClickHouse to be healthy before starting (via depends_on condition). If it still fails, increase the healthcheck timeout.

Redpanda topic not created

Symptom: Gateway logs show topic 'raw-telemetry' does not exist.

Fix:

# Manually create the topic
docker exec infrasage-redpanda \
  rpk topic create raw-telemetry --brokers localhost:29092 --partitions 3

No Data in ClickHouse

Symptom: You're sending telemetry but ClickHouse shows 0 rows.

Debugging steps:

Verify the gateway is receiving requests

curl -s '$PROMETHEUS_URL/api/v1/query?query=infrasage_ingestion_events_total' | jq
# Should show non-zero counter

Check if events are failing validation

curl $INFRASAGE_URL/api/v1/debug/dlq-stats
# If DLQ is growing, your records are failing validation

Check Kafka consumer lag

docker exec infrasage-redpanda rpk group describe infrasage-ingestion-group
# If LAG is growing, the Telemetry Operator is not consuming fast enough

Check Operator logs

docker logs infrasage-telemetry-operator | grep -i error

Anomalies Not Being Detected

Symptom: You're ingesting data with extreme values but no anomalies appear.

Causes:

Not enough historical data — the Watchdog needs at least 30 data points per metric to compute a baseline Z-score. Send data continuously for 5+ minutes.
Z-score threshold too high — lower WATCHDOG_Z_SCORE_THRESHOLD from 3.0 to 2.5.
Watchdog interval is slow — the default is 60 seconds. Reduce WATCHDOG_INTERVAL_SECONDS=10 for testing.
RCA cooldown active — if an anomaly was detected for this metric recently, it won't be re-detected until WATCHDOG_RCA_COOLDOWN_MINUTES elapses (default: 15 min).

# Force anomaly for testing
curl -X POST $INFRASAGE_URL/api/v1/telemetry \
  -H "Content-Type: application/json" \
  -d '{
    "service_id": "test-service",
    "metric_name": "cpu_usage_percent",
    "value": 9999.0,
    "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
  }'

RCA Not Triggering

Symptom: Anomalies appear in ClickHouse, but no RCA results.

Causes:

ANTHROPIC_API_KEY not set — RCA requires a valid Anthropic API key.

echo $ANTHROPIC_API_KEY
# Should be non-empty and start with sk-ant-

Anthropic API error — check AIops Engine logs:

docker logs infrasage-aiops-engine | grep -i "anthropic\|llm\|rca"

RCA cooldown — verify no other RCA for the same service/metric ran recently.

docker exec infrasage-clickhouse clickhouse-client \
  --user infrasage --password infrasage-dev \
  --query "SELECT analyzed_at FROM infrasage.infrasage_rca_results
           WHERE service_id = 'test-service' ORDER BY analyzed_at DESC LIMIT 1"

Rate Limit Errors (HTTP 429)

Symptom: Ingestion returns 429 Too Many Requests.

Causes:

Per-key rate limit exceeded — increase rate_limit_rps on the API key, or create additional keys with shared rate limits.
Tenant plan quota exceeded — check monthly event usage:
```
curl $INFRASAGE_URL/api/v1/usage -H "Authorization: Bearer $JWT"
```
Upgrade your plan or wait for the next billing period.

Sending too fast — implement exponential backoff in your ingestion client:

import time
for attempt in range(5):
    resp = requests.post(url, json=payload)
    if resp.status_code != 429:
        break
    time.sleep(2 ** attempt)

High Kafka Consumer Lag

Symptom: rpk group describe shows growing LAG for infrasage-ingestion-group.

Fixes:

Increase OPERATOR_WORKER_COUNT (e.g., from 2 to 4)
Increase Kafka partition count and scale Telemetry Operator replicas

Check ClickHouse write performance:

docker exec infrasage-clickhouse clickhouse-client \
  --query "SELECT * FROM system.metrics WHERE metric LIKE '%Write%'"

Grafana Dashboards Show No Data

Symptom: Grafana dashboards are empty or show "No data".

Fixes:

Prometheus data source not configured — In Grafana: Settings → Data Sources → Add Prometheus → URL: http://prometheus:9090
Prometheus not scraping InfraSage — Check $PROMETHEUS_URL/targets — all targets should be UP.
Dashboards not provisioned — Restart Grafana: docker-compose restart grafana

Services Won't Start​

ClickHouse fails to start​

Ingestion Gateway can't connect to ClickHouse​

Redpanda topic not created​

No Data in ClickHouse​

Anomalies Not Being Detected​

RCA Not Triggering​

Rate Limit Errors (HTTP 429)​

High Kafka Consumer Lag​

Grafana Dashboards Show No Data​

Services Won't Start

ClickHouse fails to start

Ingestion Gateway can't connect to ClickHouse

Redpanda topic not created

No Data in ClickHouse

Anomalies Not Being Detected

RCA Not Triggering

Rate Limit Errors (HTTP 429)

High Kafka Consumer Lag

Grafana Dashboards Show No Data