Welcome to InfraSage
InfraSage is an enterprise-grade, AI-powered AIOps observability platform that solves the core pain point of modern SRE teams:
"We have more data than ever, but less clarity on what's actually wrong."
InfraSage gives you the full pipeline — from raw telemetry ingestion through automated incident resolution — in a single, cohesive system.
What InfraSage Does
- Ingest — Receive metrics, logs, and traces from any source via Prometheus remote-write, OpenTelemetry (OTLP), or plain JSON. Horizontally scalable to handle millions of events per second.
- Detect — Multi-layer anomaly detection (Z-score watchdog + Isolation Forest + adaptive thresholds) identifies problems in real time.
- Analyze — Anthropic Claude performs root cause analysis (RCA), estimates blast radius, and matches against historical incidents — all in ~20 seconds.
- Remediate — Automated runbooks execute Kubernetes, HTTP, shell, or Slack actions. Human approval gates keep humans in control of destructive steps.
Core Services
| Service | Port | Role |
|---|---|---|
| Ingestion Gateway | 8080 | Receives all telemetry; validates and streams to Kafka |
| Telemetry Operator | 8081 | Aggregates, enriches, normalizes; persists to ClickHouse |
| AIops Engine | 8080, 9093 | Anomaly detection, RCA, runbook execution, LLM integration |
| Integration Poller | — | Polls AWS CloudWatch, Kubernetes metrics |
| RCA MCP Server | — | Model Context Protocol bridge for AI agents |
Key Capabilities
- Multi-format ingestion — Prometheus, OTLP, JSON; auto-detected and normalized
- Statistical + ML anomaly detection — Z-score baselines, Isolation Forest, seasonal thresholds
- Log anomaly detection — Drain template clustering, novel/burst detection, heuristic semantic scanning (no instrumentation needed), Claude-powered duplicate suppression, raw log drill-down in alerts
- Causal Anomaly Detection (CIAD) — transfer entropy monitoring of causal relationships between signals; fires 3–10 min before full incidents
- Telemetry Quality Scoring (TQS) — per-service data quality grades across 5 dimensions; surfaces gaps and improvement actions before they cause missed detections
- LLM-backed RCA — Anthropic Claude with vector-similarity incident memory
- Automated runbooks — Kubernetes, HTTP, shell, Slack; with human-in-the-loop approval
- Advanced ML Engine — XGBoost, ARIMA forecasting, causal inference, model drift detection
- 9 enterprise integrations — OpenTelemetry, AWS, Kubernetes, PagerDuty, Jira, Slack, Teams, Webhooks
- Multi-tenancy — 5-tier RBAC, per-tenant data isolation, scoped API keys, usage metering
- CPaaS billing — Free, Starter, Pro, and Enterprise plans with hourly event metering
Performance at Scale
InfraSage is architected for extreme throughput and low latency at every layer of the pipeline:
| Capability | Specification |
|---|---|
| Ingestion throughput | Scales horizontally — millions of events/sec across a multi-node cluster |
| ClickHouse query latency | Sub-millisecond on billions of stored records |
| Anomaly detection | Real-time, continuous — under 100 ms end-to-end; CIAD pre-fault fires 3–10 min earlier |
| RCA generation | ~20 seconds from anomaly detected to root cause delivered |
| API latency (P99) | Under 50 ms |
| Data durability | Zero data loss — Kafka-backed buffering with DLQ for failed records |
Scale profiles and recommended infrastructure configurations are documented in Scale Profiles.
Quick Navigation
- New to InfraSage? Start with the Quick Start guide — sign up at console.infrasage.dev and send your first telemetry in minutes.
- Evaluating self-hosted? See Deployment Options for Cloud and Self-Hosted paths.
- Connecting your stack? Browse Integrations.
- API docs? Head to API Reference.
- Questions about billing? Check Plans & Pricing.