Skip to main content

Welcome to InfraSage

InfraSage is an enterprise-grade, AI-powered AIOps observability platform that solves the core pain point of modern SRE teams:

"We have more data than ever, but less clarity on what's actually wrong."

InfraSage gives you the full pipeline — from raw telemetry ingestion through automated incident resolution — in a single, cohesive system.


What InfraSage Does

Raw Telemetry → Ingestion → Anomaly Detection → AI Root Cause Analysis → Auto Remediation
(metrics/logs/traces) (77K msg/sec) (< 100 ms) (~20 sec with Claude) (runbooks + rollback)
  1. Ingest — Receive metrics, logs, and traces from any source via Prometheus remote-write, OpenTelemetry (OTLP), or plain JSON at 77,000+ messages per second.
  2. Detect — Multi-layer anomaly detection (Z-score watchdog + Isolation Forest + adaptive thresholds) identifies problems in real time.
  3. Analyze — Anthropic Claude performs root cause analysis (RCA), estimates blast radius, and matches against historical incidents — all in ~20 seconds.
  4. Remediate — Automated runbooks execute Kubernetes, HTTP, shell, or Slack actions. Human approval gates keep humans in control of destructive steps.

Core Services

ServicePortRole
Ingestion Gateway8080Receives all telemetry; validates and streams to Kafka
Telemetry Operator8081Aggregates, enriches, normalizes; persists to ClickHouse
AIops Engine8080, 9093Anomaly detection, RCA, runbook execution, LLM integration
Integration PollerPolls AWS CloudWatch, Kubernetes metrics
RCA MCP ServerModel Context Protocol bridge for AI agents

Key Capabilities

  • Multi-format ingestion — Prometheus, OTLP, JSON; auto-detected and normalized
  • Statistical + ML anomaly detection — Z-score baselines, Isolation Forest, seasonal thresholds
  • LLM-backed RCA — Anthropic Claude with vector-similarity incident memory
  • Automated runbooks — Kubernetes, HTTP, shell, Slack; with human-in-the-loop approval
  • Advanced ML Engine — XGBoost, ARIMA forecasting, causal inference, model drift detection
  • 9 enterprise integrations — OpenTelemetry, AWS, Kubernetes, PagerDuty, Jira, Slack, Teams, Webhooks
  • Multi-tenancy — 5-tier RBAC, per-tenant data isolation, scoped API keys, usage metering
  • CPaaS billing — Free, Starter, Pro, and Enterprise plans with hourly event metering

Performance

Validated on a single t3.xlarge EC2 instance (4 vCPU, 16 GB RAM):

MetricResultTarget
Ingestion throughput77,190 msg/sec10,000 msg/sec
Total records (12 min run)53.5 million1 million
ClickHouse query latency< 1 ms< 100 ms
Anomaly detection latency< 100 msreal-time
RCA generation time~20 sec
P99 API latency< 50 ms
Data loss0 records0

Quick Navigation