Debugging
Tools and commands for diagnosing InfraSage issues.
Health Endpoints
Check the health of each service:
# Ingestion Gateway
curl http://localhost:8080/healthz
# Telemetry Operator
curl http://localhost:8081/healthz
# AIops Engine
curl http://localhost:8080/api/v1/status
A healthy response looks like:
{"status": "ok", "clickhouse": "ok", "kafka": "ok", "version": "1.2.0"}
Service Logs
# Docker Compose
docker logs infrasage-ingestion-gateway -f
docker logs infrasage-telemetry-operator -f
docker logs infrasage-aiops-engine -f
docker logs infrasage-clickhouse -f
# Filter for errors only
docker logs infrasage-aiops-engine 2>&1 | grep -E "ERROR|FATAL"
# Kubernetes
kubectl logs -n infrasage deployment/ingestion-gateway -f
kubectl logs -n infrasage deployment/aiops-engine --tail=100
Prometheus Metrics
Key metrics to query at http://localhost:9999:
# Ingestion rate
rate(infrasage_ingestion_events_total[1m])
# Validation failure rate
rate(infrasage_validation_failures_total[5m])
# DLQ depth (should be near 0)
infrasage_dlq_depth
# Kafka consumer lag
infrasage_kafka_consumer_lag_sum
# ClickHouse write latency (p99)
histogram_quantile(0.99, infrasage_clickhouse_write_duration_seconds_bucket)
# Anomaly detection rate
rate(infrasage_anomalies_detected_total[5m])
# API error rate
rate(infrasage_http_requests_total{status=~"5.."}[5m])
ClickHouse Diagnostics
# Connect to ClickHouse
docker exec -it infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev
# Check table sizes
SELECT table, formatReadableSize(total_bytes) AS size, total_rows
FROM system.tables
WHERE database = 'infrasage'
ORDER BY total_bytes DESC;
# Check recent ingestion
SELECT count(), max(timestamp) AS last_seen
FROM infrasage.infrasage_raw_firehose
WHERE timestamp > now() - INTERVAL 5 MINUTE;
# Check anomalies in last hour
SELECT service_id, metric_name, anomaly_score, timestamp
FROM infrasage.infrasage_anomalies
WHERE timestamp > now() - INTERVAL 1 HOUR
ORDER BY anomaly_score DESC
LIMIT 20;
# Check RCA results
SELECT service_id, root_cause_category, root_cause_confidence, analyzed_at
FROM infrasage.infrasage_rca_results
ORDER BY analyzed_at DESC
LIMIT 10;
# Check disk usage
SELECT formatReadableSize(free_space), formatReadableSize(total_space)
FROM system.disks;
Kafka / Redpanda Diagnostics
# List topics
docker exec infrasage-redpanda rpk topic list
# Check consumer group lag
docker exec infrasage-redpanda \
rpk group describe infrasage-ingestion-group
# Check topic details
docker exec infrasage-redpanda \
rpk topic describe raw-telemetry
# Monitor in real time (consume last 5 messages)
docker exec infrasage-redpanda \
rpk topic consume raw-telemetry --num 5 --offset end
Dead-Letter Queue (DLQ)
# Check DLQ stats
curl http://localhost:8080/api/v1/debug/dlq-stats
# View DLQ contents in ClickHouse
docker exec infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev \
--query "
SELECT failure_reason, raw_payload, received_at
FROM infrasage.infrasage_dlq
ORDER BY received_at DESC
LIMIT 10
"
# Fix and replay a DLQ record
# 1. Extract the record
# 2. Fix the invalid field
# 3. Re-POST to /api/v1/telemetry
Network Connectivity
# Test ClickHouse connectivity
docker exec infrasage-ingestion-gateway \
wget -qO- http://clickhouse:8123/ping
# Test Redpanda connectivity
docker exec infrasage-ingestion-gateway \
wget -qO- http://redpanda:29092 2>&1 | head -5
# Test from Kubernetes
kubectl exec -n infrasage deployment/ingestion-gateway -- \
wget -qO- http://clickhouse:8123/ping
Enable Debug Logging
# For a running service (Docker Compose)
docker-compose exec ingestion-gateway \
kill -USR1 1 # If the service supports SIGUSR1 for log level change
# Or restart with debug level
LOG_LEVEL=debug docker-compose up -d ingestion-gateway
InfraSage CLI
The infrasage-cli binary provides operator-level debugging commands:
# Build the CLI
go build -o infrasage-cli ./cmd/infrasage-cli/
# Query recent anomalies
./infrasage-cli anomalies --since 1h --min-score 0.5
# Force a watchdog run
./infrasage-cli watchdog run --service payment-api
# Query ClickHouse directly
./infrasage-cli query "SELECT count() FROM infrasage_raw_firehose WHERE timestamp > now() - INTERVAL 5 MINUTE"
# View DLQ
./infrasage-cli dlq list --limit 20
./infrasage-cli dlq replay <record-id>
Getting Help
If you're still stuck:
- Check the Common Issues guide
- Search the GitHub Issues
- Open a new issue with:
- InfraSage version (
curl http://localhost:8080/healthz | jq .version) - Docker Compose or Kubernetes?
- Relevant log output (redact any secrets)
- Steps to reproduce
- InfraSage version (