Architecture
InfraSage is built as a set of loosely-coupled microservices connected through a Kafka-compatible message bus and a shared ClickHouse time-series database.
Data Flow
External Sources
│
▼
┌─────────────────┐ Kafka topic: ┌──────────────────────┐
│ Ingestion │ ──► raw-telemetry ──► │ Telemetry Operator │
│ Gateway :8080 │ (Redpanda) │ :8081 │
│ │ │ │
│ • Prometheus │ │ • Normalize │
│ • OTLP │ │ • Enrich │
│ • JSON │ │ • Aggregate │
└─────────────────┘ │ • Write to │
│ │ ClickHouse │
│ Healthz └──────────────────────┘
│ │
│ SQL
│ │
│ ┌──────────▼───────────┐
│ │ ClickHouse │
│ │ (Time-Series DB) │
│ │ │
│ │ 53M+ records/run │
│ │ < 1ms query latency │
│ └──────────┬───────────┘
│ │
│ SQL
│ │
│ ┌───────────────▼──────────┐
│ │ AIops Engine :8080/9093 │
│ │ │
│ │ • Watchdog (Z-score) │
│ │ • Isolation Forest │
│ │ • LLM RCA (Claude) │
│ │ • Vector similarity │
│ │ • Runbook execution │
│ │ • Multi-tenant RBAC │
│ └───────────────┬──────────┘
│ │
└──────────────────────────────┐ │ Webhooks / APIs
│ ▼
┌─────▼────────────────────────┐
│ Integrations │
│ │
│ Slack PagerDuty Jira │
│ Teams Webhooks OTEL │
│ AWS CloudWatch Kubernetes │
└──────────────────────────────┘
Services
Ingestion Gateway (:8080)
The single entry point for all telemetry data. Responsibilities:
- Format detection — automatically identifies Prometheus remote-write, OTLP, or JSON payloads
- Validation — enforces timestamp freshness, value ranges, required fields
- Idempotency — content-hash deduplication prevents duplicate writes during retries
- DLQ — invalid records are stored in the dead-letter queue for inspection and replay
- Kafka publishing — valid records are published to the
raw-telemetrytopic
Telemetry Operator (:8081)
Consumes from Kafka and processes the data stream:
- Normalization — translates different source formats to a canonical
TelemetryRecordmodel - Enrichment — adds inferred metadata (service discovery, environment tagging)
- Batch writes — accumulates records and writes in configurable batches to ClickHouse (default: 10,000 records)
- Exemplar management — stores high-cardinality trace/span references separately
AIops Engine (:8080 + :9093 webhook)
The brain of InfraSage:
- Watchdog — polls ClickHouse on a configurable interval; maintains a per-metric ring buffer for sliding-window Z-score analysis
- Isolation Forest — unsupervised ML anomaly scoring across multiple metrics simultaneously
- LLM RCA — sends anomaly context to Anthropic Claude; receives structured root-cause, suggested actions, and confidence scores
- Vector memory — HNSW index of past incidents; semantically matches new anomalies to historical patterns
- Runbook executor — runs remediation actions (Kubernetes, HTTP, shell, Slack) with approval gates and rollback
- Multi-tenancy — enforces RBAC, per-tenant quotas, and data isolation
Integration Poller
Continuously polls external systems:
- AWS CloudWatch — EC2, RDS, Lambda, ALB, DynamoDB, S3 metrics
- Kubernetes — pod/node metrics, namespace events
RCA MCP Server
Exposes InfraSage's RCA capabilities as a Model Context Protocol (MCP) server, allowing AI agents to query root-cause analysis results programmatically.
Infrastructure Components
| Component | Version | Purpose |
|---|---|---|
| ClickHouse | v26+ | Time-series storage with MergeTree engine |
| Redpanda | v23.3+ | Kafka-compatible message broker |
| Prometheus | latest | Metric scraping and alerting |
| Grafana | latest | Visualization dashboards |
Database Schema (Key Tables)
| Table | Purpose |
|---|---|
infrasage_raw_firehose | All ingested telemetry |
infrasage_exemplars | High-cardinality trace/span data |
infrasage_telemetry_catalog | Service/metric discovery |
infrasage_anomalies | Detected anomalies with scores |
infrasage_rca_results | RCA outputs from Claude |
infrasage_incidents | Correlated incident groups |
infrasage_knowledge_base | Historical RCA learnings |
infrasage_api_keys | Tenant API keys |
infrasage_usage_metering | Hourly event counts per tenant |
infrasage_billing_plans | Free/Starter/Pro/Enterprise definitions |
infrasage_audit_log | All state changes (365-day TTL) |
Technology Stack
| Layer | Technology |
|---|---|
| Language | Go 1.25 (statically compiled) |
| Storage | ClickHouse (columnar, MergeTree) |
| Messaging | Redpanda / Kafka (franz-go client) |
| Monitoring | Prometheus + Grafana |
| Tracing | OpenTelemetry |
| AI | Anthropic Claude API |
| Vector Index | HNSW (custom Go implementation) |
| Auth | JWT (golang.org/x/crypto) |
| Containers | Docker + Kubernetes |
| Cloud | AWS SDK v2 |