Skip to main content

Debugging

Tools and commands for diagnosing InfraSage issues.


Health Endpoints

Check the health of each service:

# Ingestion Gateway
curl http://localhost:8080/healthz

# Telemetry Operator
curl http://localhost:8081/healthz

# AIops Engine
curl http://localhost:8080/api/v1/status

A healthy response looks like:

{"status": "ok", "clickhouse": "ok", "kafka": "ok", "version": "1.2.0"}

Service Logs

# Docker Compose
docker logs infrasage-ingestion-gateway -f
docker logs infrasage-telemetry-operator -f
docker logs infrasage-aiops-engine -f
docker logs infrasage-clickhouse -f

# Filter for errors only
docker logs infrasage-aiops-engine 2>&1 | grep -E "ERROR|FATAL"

# Kubernetes
kubectl logs -n infrasage deployment/ingestion-gateway -f
kubectl logs -n infrasage deployment/aiops-engine --tail=100

Prometheus Metrics

Key metrics to query at http://localhost:9999:

# Ingestion rate
rate(infrasage_ingestion_events_total[1m])

# Validation failure rate
rate(infrasage_validation_failures_total[5m])

# DLQ depth (should be near 0)
infrasage_dlq_depth

# Kafka consumer lag
infrasage_kafka_consumer_lag_sum

# ClickHouse write latency (p99)
histogram_quantile(0.99, infrasage_clickhouse_write_duration_seconds_bucket)

# Anomaly detection rate
rate(infrasage_anomalies_detected_total[5m])

# API error rate
rate(infrasage_http_requests_total{status=~"5.."}[5m])

ClickHouse Diagnostics

# Connect to ClickHouse
docker exec -it infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev

# Check table sizes
SELECT table, formatReadableSize(total_bytes) AS size, total_rows
FROM system.tables
WHERE database = 'infrasage'
ORDER BY total_bytes DESC;

# Check recent ingestion
SELECT count(), max(timestamp) AS last_seen
FROM infrasage.infrasage_raw_firehose
WHERE timestamp > now() - INTERVAL 5 MINUTE;

# Check anomalies in last hour
SELECT service_id, metric_name, anomaly_score, timestamp
FROM infrasage.infrasage_anomalies
WHERE timestamp > now() - INTERVAL 1 HOUR
ORDER BY anomaly_score DESC
LIMIT 20;

# Check RCA results
SELECT service_id, root_cause_category, root_cause_confidence, analyzed_at
FROM infrasage.infrasage_rca_results
ORDER BY analyzed_at DESC
LIMIT 10;

# Check disk usage
SELECT formatReadableSize(free_space), formatReadableSize(total_space)
FROM system.disks;

Kafka / Redpanda Diagnostics

# List topics
docker exec infrasage-redpanda rpk topic list

# Check consumer group lag
docker exec infrasage-redpanda \
rpk group describe infrasage-ingestion-group

# Check topic details
docker exec infrasage-redpanda \
rpk topic describe raw-telemetry

# Monitor in real time (consume last 5 messages)
docker exec infrasage-redpanda \
rpk topic consume raw-telemetry --num 5 --offset end

Dead-Letter Queue (DLQ)

# Check DLQ stats
curl http://localhost:8080/api/v1/debug/dlq-stats

# View DLQ contents in ClickHouse
docker exec infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev \
--query "
SELECT failure_reason, raw_payload, received_at
FROM infrasage.infrasage_dlq
ORDER BY received_at DESC
LIMIT 10
"

# Fix and replay a DLQ record
# 1. Extract the record
# 2. Fix the invalid field
# 3. Re-POST to /api/v1/telemetry

Network Connectivity

# Test ClickHouse connectivity
docker exec infrasage-ingestion-gateway \
wget -qO- http://clickhouse:8123/ping

# Test Redpanda connectivity
docker exec infrasage-ingestion-gateway \
wget -qO- http://redpanda:29092 2>&1 | head -5

# Test from Kubernetes
kubectl exec -n infrasage deployment/ingestion-gateway -- \
wget -qO- http://clickhouse:8123/ping

Enable Debug Logging

# For a running service (Docker Compose)
docker-compose exec ingestion-gateway \
kill -USR1 1 # If the service supports SIGUSR1 for log level change

# Or restart with debug level
LOG_LEVEL=debug docker-compose up -d ingestion-gateway

InfraSage CLI

The infrasage-cli binary provides operator-level debugging commands:

# Build the CLI
go build -o infrasage-cli ./cmd/infrasage-cli/

# Query recent anomalies
./infrasage-cli anomalies --since 1h --min-score 0.5

# Force a watchdog run
./infrasage-cli watchdog run --service payment-api

# Query ClickHouse directly
./infrasage-cli query "SELECT count() FROM infrasage_raw_firehose WHERE timestamp > now() - INTERVAL 5 MINUTE"

# View DLQ
./infrasage-cli dlq list --limit 20
./infrasage-cli dlq replay <record-id>

Getting Help

If you're still stuck:

  1. Check the Common Issues guide
  2. Search the GitHub Issues
  3. Open a new issue with:
    • InfraSage version (curl http://localhost:8080/healthz | jq .version)
    • Docker Compose or Kubernetes?
    • Relevant log output (redact any secrets)
    • Steps to reproduce