Skip to main content

Dashboards & Visualization

InfraSage ships with pre-configured Grafana dashboards and a React-based Admin UI for managing every aspect of the platform.


Grafana Dashboards

Access Grafana at http://localhost:3000 (default credentials: admin / admin).

System Health Dashboard

Overview of all InfraSage services:

  • Ingestion Gateway: events received/sec, error rate, DLQ depth
  • Telemetry Operator: Kafka consumer lag, ClickHouse write throughput
  • AIops Engine: anomaly detection rate, RCA runs/hour, runbook executions
  • ClickHouse: disk usage, query latency, insert rate

Telemetry Ingestion Dashboard

Detailed ingestion metrics:

  • Events/sec by service and type (metric, log, trace, event, profile, slo)
  • Batch write latency percentiles (p50, p95, p99)
  • Validation failure breakdown by reason
  • Dead-letter queue depth over time

Anomaly Detection Dashboard

  • Anomaly score heatmap by service and metric
  • Anomaly rate timeline (total anomalies per hour)
  • Top anomalous services ranked by score
  • Z-score distribution per metric

RCA Results Dashboard

  • RCA runs per hour
  • Root cause category breakdown (infrastructure, application, external, unknown)
  • Average confidence score over time
  • Time-to-RCA distribution
  • Blast radius size histogram

ML Model Performance Dashboard

  • Model drift scores per service
  • Prediction accuracy (when ground truth is available via resolution feedback)
  • Feature importance rankings
  • Shadow vs. production model comparison

Prometheus Metrics

Prometheus scrapes all InfraSage services every 15 seconds. Access at http://localhost:9999.

Key metrics to query

# Ingestion rate (events per second)
rate(infrasage_ingestion_events_total[1m])

# Anomaly detection rate
rate(infrasage_anomalies_detected_total[5m])

# RCA completion rate
rate(infrasage_rca_completed_total[1h])

# ClickHouse write latency (p99)
histogram_quantile(0.99, infrasage_clickhouse_write_duration_seconds_bucket)

# Kafka consumer lag
infrasage_kafka_consumer_lag_sum

# DLQ depth
infrasage_dlq_depth

# API request latency (p99)
histogram_quantile(0.99, infrasage_http_request_duration_seconds_bucket)

Admin UI

The InfraSage Admin UI (React + TypeScript + Vite + TailwindCSS) is a full-featured control panel available at http://localhost:4000 when running with the UI service.

Pages

PageDescription
OverviewLive system health — ingestion rate, anomaly count, active incidents
Telemetry BrowserQuery and filter raw telemetry from ClickHouse
Anomaly ExplorerBrowse detected anomalies with score, service, and metric filters
RCA ResultsFull RCA outputs with confidence, evidence, and suggested actions
ML ModelsModel list, performance metrics, train/promote actions
RunbooksCreate, edit, execute, and review runbook history
IntegrationsConfigure Slack, PagerDuty, Jira, Teams, Webhooks
TenantsManage tenants, billing plans, API keys, user roles
Audit LogBrowse all state-changing operations
RBACConfigure user roles and permissions

Alertmanager Integration

InfraSage listens for Prometheus Alertmanager webhooks at http://localhost:9093/api/v1/alerts/webhook. Configure your Alertmanager to forward alerts to InfraSage for AI-powered RCA:

# alertmanager.yml
receivers:
- name: infrasage
webhook_configs:
- url: http://infrasage-aiops:9093/api/v1/alerts/webhook
send_resolved: true

route:
receiver: infrasage
group_wait: 30s
group_interval: 5m
repeat_interval: 3h

When InfraSage receives an alert, it:

  1. Maps it to the corresponding service and metric in ClickHouse
  2. Triggers RCA if cooldown period has elapsed
  3. Sends enriched analysis back through configured notification channels

Adding Custom Grafana Dashboards

InfraSage Grafana is provisioned via deployments/grafana/provisioning/. To add a custom dashboard:

  1. Create your dashboard JSON in Grafana
  2. Export it via Share → Export → Save to file
  3. Place the JSON in deployments/grafana/provisioning/dashboards/
  4. Restart Grafana: docker-compose restart grafana

The dashboard will persist across restarts via the provisioning volume.