Dashboards & Visualization

InfraSage ships with pre-configured Grafana dashboards and a React-based Admin UI for managing every aspect of the platform.

Grafana Dashboards

Access Grafana at $GRAFANA_URL (default credentials: admin / admin).

System Health Dashboard

Overview of all InfraSage services:

Ingestion Gateway: events received/sec, error rate, DLQ depth
Telemetry Operator: Kafka consumer lag, ClickHouse write throughput
AIops Engine: anomaly detection rate, RCA runs/hour, runbook executions
ClickHouse: disk usage, query latency, insert rate

Telemetry Ingestion Dashboard

Detailed ingestion metrics:

Events/sec by service and type (metric, log, trace, event, profile, slo)
Batch write latency percentiles (p50, p95, p99)
Validation failure breakdown by reason
Dead-letter queue depth over time

Anomaly Detection Dashboard

Anomaly score heatmap by service and metric
Anomaly rate timeline (total anomalies per hour)
Top anomalous services ranked by score
Z-score distribution per metric

RCA Results Dashboard

RCA runs per hour
Root cause category breakdown (infrastructure, application, external, unknown)
Average confidence score over time
Time-to-RCA distribution
Blast radius size histogram

ML Model Performance Dashboard

Model drift scores per service
Prediction accuracy (when ground truth is available via resolution feedback)
Feature importance rankings
Shadow vs. production model comparison

Prometheus Metrics

Prometheus scrapes all InfraSage services every 15 seconds. Access at $PROMETHEUS_URL.

Key metrics to query

# Ingestion rate (events per second)
rate(infrasage_ingestion_events_total[1m])

# Anomaly detection rate
rate(infrasage_anomalies_detected_total[5m])

# RCA completion rate
rate(infrasage_rca_completed_total[1h])

# ClickHouse write latency (p99)
histogram_quantile(0.99, infrasage_clickhouse_write_duration_seconds_bucket)

# Kafka consumer lag
infrasage_kafka_consumer_lag_sum

# DLQ depth
infrasage_dlq_depth

# API request latency (p99)
histogram_quantile(0.99, infrasage_http_request_duration_seconds_bucket)

Admin UI

The InfraSage Admin UI (React + TypeScript + Vite + TailwindCSS) is a full-featured control panel available at $INFRASAGE_UI_URL when running with the UI service.

Pages

Page	Description
Overview	Live system health — ingestion rate, anomaly count, active incidents
Telemetry Browser	Query and filter raw telemetry from ClickHouse
Anomaly Explorer	Browse detected anomalies with score, service, and metric filters
RCA Results	Full RCA outputs with confidence, evidence, and suggested actions
ML Models	Model list, performance metrics, train/promote actions
Runbooks	Create, edit, execute, and review runbook history
Integrations	Configure Slack, PagerDuty, Jira, Teams, Webhooks
Tenants	Manage tenants, billing plans, API keys, user roles
Audit Log	Browse all state-changing operations
RBAC	Configure user roles and permissions

Alertmanager Integration

InfraSage listens for Prometheus Alertmanager webhooks at $INFRASAGE_URL/api/v1/alerts/webhook. Configure your Alertmanager to forward alerts to InfraSage for AI-powered RCA:

# alertmanager.yml
receivers:
  - name: infrasage
    webhook_configs:
      - url: http://infrasage-aiops:9093/api/v1/alerts/webhook
        send_resolved: true

route:
  receiver: infrasage
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

When InfraSage receives an alert, it:

Maps it to the corresponding service and metric in ClickHouse
Triggers RCA if cooldown period has elapsed
Sends enriched analysis back through configured notification channels

Adding Custom Grafana Dashboards

InfraSage Grafana is provisioned via deployments/grafana/provisioning/. To add a custom dashboard:

Create your dashboard JSON in Grafana
Export it via Share → Export → Save to file
Place the JSON in deployments/grafana/provisioning/dashboards/
Restart Grafana: docker-compose restart grafana

The dashboard will persist across restarts via the provisioning volume.

Grafana Dashboards​

System Health Dashboard​

Telemetry Ingestion Dashboard​

Anomaly Detection Dashboard​

RCA Results Dashboard​

ML Model Performance Dashboard​

Prometheus Metrics​

Key metrics to query​

Admin UI​

Pages​

Alertmanager Integration​

Adding Custom Grafana Dashboards​