Welcome to InfraSage

InfraSage is an enterprise-grade, AI-powered AIOps observability platform that solves the core pain point of modern SRE teams:

"We have more data than ever, but less clarity on what's actually wrong."

InfraSage gives you the full pipeline — from raw telemetry ingestion through automated incident resolution — in a single, cohesive system.

What InfraSage Does

InfraSage data pipeline — Raw Telemetry → Ingestion → Detection → AI Root Cause Analysis → Remediation

Ingest — Receive metrics, logs, and traces from any source via Prometheus remote-write, OpenTelemetry (OTLP), or plain JSON. Horizontally scalable to handle millions of events per second.
Detect — Multi-layer anomaly detection (Z-score watchdog + Isolation Forest + adaptive thresholds) identifies problems in real time.
Analyze — Anthropic Claude performs root cause analysis (RCA), estimates blast radius, and matches against historical incidents — all in ~20 seconds.
Remediate — Automated runbooks execute Kubernetes, HTTP, shell, or Slack actions. Human approval gates keep humans in control of destructive steps.

Service	Port	Role
Ingestion Gateway	8080	Receives all telemetry; validates and streams to Kafka
Telemetry Operator	8081	Aggregates, enriches, normalizes; persists to ClickHouse
AIops Engine	8080, 9093	Anomaly detection, RCA, runbook execution, LLM integration
Integration Poller	—	Polls AWS CloudWatch, Kubernetes metrics
RCA MCP Server	—	Model Context Protocol bridge for AI agents

Multi-format ingestion — Prometheus, OTLP, JSON; auto-detected and normalized
Statistical + ML anomaly detection — Z-score baselines, Isolation Forest, seasonal thresholds
Log anomaly detection — Drain template clustering, novel/burst detection, heuristic semantic scanning (no instrumentation needed), Claude-powered duplicate suppression, raw log drill-down in alerts
Causal Anomaly Detection (CIAD) — transfer entropy monitoring of causal relationships between signals; fires 3–10 min before full incidents
Telemetry Quality Scoring (TQS) — per-service data quality grades across 5 dimensions; surfaces gaps and improvement actions before they cause missed detections
LLM-backed RCA — Anthropic Claude with vector-similarity incident memory
Automated runbooks — Kubernetes, HTTP, shell, Slack; with human-in-the-loop approval
Advanced ML Engine — XGBoost, ARIMA forecasting, causal inference, model drift detection
9 enterprise integrations — OpenTelemetry, AWS, Kubernetes, PagerDuty, Jira, Slack, Teams, Webhooks
Multi-tenancy — 5-tier RBAC, per-tenant data isolation, scoped API keys, usage metering
CPaaS billing — Free, Starter, Pro, and Enterprise plans with hourly event metering

InfraSage is architected for extreme throughput and low latency at every layer of the pipeline:

Capability	Specification
Ingestion throughput	Scales horizontally — millions of events/sec across a multi-node cluster
ClickHouse query latency	Sub-millisecond on billions of stored records
Anomaly detection	Real-time, continuous — under 100 ms end-to-end; CIAD pre-fault fires 3–10 min earlier
RCA generation	~20 seconds from anomaly detected to root cause delivered
API latency (P99)	Under 50 ms
Data durability	Zero data loss — Kafka-backed buffering with DLQ for failed records

Scale profiles and recommended infrastructure configurations are documented in Scale Profiles.

New to InfraSage? Start with the Quick Start guide — sign up at console.infrasage.dev and send your first telemetry in minutes.
Evaluating self-hosted? See Deployment Options for Cloud and Self-Hosted paths.
Connecting your stack? Browse Integrations.
API docs? Head to API Reference.
Questions about billing? Check Plans & Pricing.