Welcome to InfraSage
InfraSage is an enterprise-grade, AI-powered AIOps observability platform that solves the core pain point of modern SRE teams:
"We have more data than ever, but less clarity on what's actually wrong."
InfraSage gives you the full pipeline — from raw telemetry ingestion through automated incident resolution — in a single, cohesive system.
What InfraSage Does
Raw Telemetry → Ingestion → Anomaly Detection → AI Root Cause Analysis → Auto Remediation
(metrics/logs/traces) (77K msg/sec) (< 100 ms) (~20 sec with Claude) (runbooks + rollback)
- Ingest — Receive metrics, logs, and traces from any source via Prometheus remote-write, OpenTelemetry (OTLP), or plain JSON at 77,000+ messages per second.
- Detect — Multi-layer anomaly detection (Z-score watchdog + Isolation Forest + adaptive thresholds) identifies problems in real time.
- Analyze — Anthropic Claude performs root cause analysis (RCA), estimates blast radius, and matches against historical incidents — all in ~20 seconds.
- Remediate — Automated runbooks execute Kubernetes, HTTP, shell, or Slack actions. Human approval gates keep humans in control of destructive steps.
Core Services
| Service | Port | Role |
|---|---|---|
| Ingestion Gateway | 8080 | Receives all telemetry; validates and streams to Kafka |
| Telemetry Operator | 8081 | Aggregates, enriches, normalizes; persists to ClickHouse |
| AIops Engine | 8080, 9093 | Anomaly detection, RCA, runbook execution, LLM integration |
| Integration Poller | — | Polls AWS CloudWatch, Kubernetes metrics |
| RCA MCP Server | — | Model Context Protocol bridge for AI agents |
Key Capabilities
- Multi-format ingestion — Prometheus, OTLP, JSON; auto-detected and normalized
- Statistical + ML anomaly detection — Z-score baselines, Isolation Forest, seasonal thresholds
- LLM-backed RCA — Anthropic Claude with vector-similarity incident memory
- Automated runbooks — Kubernetes, HTTP, shell, Slack; with human-in-the-loop approval
- Advanced ML Engine — XGBoost, ARIMA forecasting, causal inference, model drift detection
- 9 enterprise integrations — OpenTelemetry, AWS, Kubernetes, PagerDuty, Jira, Slack, Teams, Webhooks
- Multi-tenancy — 5-tier RBAC, per-tenant data isolation, scoped API keys, usage metering
- CPaaS billing — Free, Starter, Pro, and Enterprise plans with hourly event metering
Performance
Validated on a single t3.xlarge EC2 instance (4 vCPU, 16 GB RAM):
| Metric | Result | Target |
|---|---|---|
| Ingestion throughput | 77,190 msg/sec | 10,000 msg/sec |
| Total records (12 min run) | 53.5 million | 1 million |
| ClickHouse query latency | < 1 ms | < 100 ms |
| Anomaly detection latency | < 100 ms | real-time |
| RCA generation time | ~20 sec | — |
| P99 API latency | < 50 ms | — |
| Data loss | 0 records | 0 |
Quick Navigation
- New to InfraSage? Start with the Quick Start guide.
- Setting up production? See Deployment Options.
- Connecting your stack? Browse Integrations.
- API docs? Head to API Reference.
- Questions about billing? Check Plans & Pricing.