Dashboards & Visualization
InfraSage ships with pre-configured Grafana dashboards and a React-based Admin UI for managing every aspect of the platform.
Grafana Dashboards
Access Grafana at http://localhost:3000 (default credentials: admin / admin).
System Health Dashboard
Overview of all InfraSage services:
- Ingestion Gateway: events received/sec, error rate, DLQ depth
- Telemetry Operator: Kafka consumer lag, ClickHouse write throughput
- AIops Engine: anomaly detection rate, RCA runs/hour, runbook executions
- ClickHouse: disk usage, query latency, insert rate
Telemetry Ingestion Dashboard
Detailed ingestion metrics:
- Events/sec by service and type (metric, log, trace, event, profile, slo)
- Batch write latency percentiles (p50, p95, p99)
- Validation failure breakdown by reason
- Dead-letter queue depth over time
Anomaly Detection Dashboard
- Anomaly score heatmap by service and metric
- Anomaly rate timeline (total anomalies per hour)
- Top anomalous services ranked by score
- Z-score distribution per metric
RCA Results Dashboard
- RCA runs per hour
- Root cause category breakdown (infrastructure, application, external, unknown)
- Average confidence score over time
- Time-to-RCA distribution
- Blast radius size histogram
ML Model Performance Dashboard
- Model drift scores per service
- Prediction accuracy (when ground truth is available via resolution feedback)
- Feature importance rankings
- Shadow vs. production model comparison
Prometheus Metrics
Prometheus scrapes all InfraSage services every 15 seconds. Access at http://localhost:9999.
Key metrics to query
# Ingestion rate (events per second)
rate(infrasage_ingestion_events_total[1m])
# Anomaly detection rate
rate(infrasage_anomalies_detected_total[5m])
# RCA completion rate
rate(infrasage_rca_completed_total[1h])
# ClickHouse write latency (p99)
histogram_quantile(0.99, infrasage_clickhouse_write_duration_seconds_bucket)
# Kafka consumer lag
infrasage_kafka_consumer_lag_sum
# DLQ depth
infrasage_dlq_depth
# API request latency (p99)
histogram_quantile(0.99, infrasage_http_request_duration_seconds_bucket)
Admin UI
The InfraSage Admin UI (React + TypeScript + Vite + TailwindCSS) is a full-featured control panel available at http://localhost:4000 when running with the UI service.
Pages
| Page | Description |
|---|---|
| Overview | Live system health — ingestion rate, anomaly count, active incidents |
| Telemetry Browser | Query and filter raw telemetry from ClickHouse |
| Anomaly Explorer | Browse detected anomalies with score, service, and metric filters |
| RCA Results | Full RCA outputs with confidence, evidence, and suggested actions |
| ML Models | Model list, performance metrics, train/promote actions |
| Runbooks | Create, edit, execute, and review runbook history |
| Integrations | Configure Slack, PagerDuty, Jira, Teams, Webhooks |
| Tenants | Manage tenants, billing plans, API keys, user roles |
| Audit Log | Browse all state-changing operations |
| RBAC | Configure user roles and permissions |
Alertmanager Integration
InfraSage listens for Prometheus Alertmanager webhooks at http://localhost:9093/api/v1/alerts/webhook. Configure your Alertmanager to forward alerts to InfraSage for AI-powered RCA:
# alertmanager.yml
receivers:
- name: infrasage
webhook_configs:
- url: http://infrasage-aiops:9093/api/v1/alerts/webhook
send_resolved: true
route:
receiver: infrasage
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
When InfraSage receives an alert, it:
- Maps it to the corresponding service and metric in ClickHouse
- Triggers RCA if cooldown period has elapsed
- Sends enriched analysis back through configured notification channels
Adding Custom Grafana Dashboards
InfraSage Grafana is provisioned via deployments/grafana/provisioning/. To add a custom dashboard:
- Create your dashboard JSON in Grafana
- Export it via Share → Export → Save to file
- Place the JSON in
deployments/grafana/provisioning/dashboards/ - Restart Grafana:
docker-compose restart grafana
The dashboard will persist across restarts via the provisioning volume.