Skip to main content

Architecture

InfraSage is built as a set of loosely-coupled microservices connected through a Kafka-compatible message bus and a shared ClickHouse time-series database.


Data Flow

External Sources


┌─────────────────┐ Kafka topic: ┌──────────────────────┐
│ Ingestion │ ──► raw-telemetry ──► │ Telemetry Operator │
│ Gateway :8080 │ (Redpanda) │ :8081 │
│ │ │ │
│ • Prometheus │ │ • Normalize │
│ • OTLP │ │ • Enrich │
│ • JSON │ │ • Aggregate │
└─────────────────┘ │ • Write to │
│ │ ClickHouse │
│ Healthz └──────────────────────┘
│ │
│ SQL
│ │
│ ┌──────────▼───────────┐
│ │ ClickHouse │
│ │ (Time-Series DB) │
│ │ │
│ │ 53M+ records/run │
│ │ < 1ms query latency │
│ └──────────┬───────────┘
│ │
│ SQL
│ │
│ ┌───────────────▼──────────┐
│ │ AIops Engine :8080/9093 │
│ │ │
│ │ • Watchdog (Z-score) │
│ │ • Isolation Forest │
│ │ • LLM RCA (Claude) │
│ │ • Vector similarity │
│ │ • Runbook execution │
│ │ • Multi-tenant RBAC │
│ └───────────────┬──────────┘
│ │
└──────────────────────────────┐ │ Webhooks / APIs
│ ▼
┌─────▼────────────────────────┐
│ Integrations │
│ │
│ Slack PagerDuty Jira │
│ Teams Webhooks OTEL │
│ AWS CloudWatch Kubernetes │
└──────────────────────────────┘

Services

Ingestion Gateway (:8080)

The single entry point for all telemetry data. Responsibilities:

  • Format detection — automatically identifies Prometheus remote-write, OTLP, or JSON payloads
  • Validation — enforces timestamp freshness, value ranges, required fields
  • Idempotency — content-hash deduplication prevents duplicate writes during retries
  • DLQ — invalid records are stored in the dead-letter queue for inspection and replay
  • Kafka publishing — valid records are published to the raw-telemetry topic

Telemetry Operator (:8081)

Consumes from Kafka and processes the data stream:

  • Normalization — translates different source formats to a canonical TelemetryRecord model
  • Enrichment — adds inferred metadata (service discovery, environment tagging)
  • Batch writes — accumulates records and writes in configurable batches to ClickHouse (default: 10,000 records)
  • Exemplar management — stores high-cardinality trace/span references separately

AIops Engine (:8080 + :9093 webhook)

The brain of InfraSage:

  • Watchdog — polls ClickHouse on a configurable interval; maintains a per-metric ring buffer for sliding-window Z-score analysis
  • Isolation Forest — unsupervised ML anomaly scoring across multiple metrics simultaneously
  • LLM RCA — sends anomaly context to Anthropic Claude; receives structured root-cause, suggested actions, and confidence scores
  • Vector memory — HNSW index of past incidents; semantically matches new anomalies to historical patterns
  • Runbook executor — runs remediation actions (Kubernetes, HTTP, shell, Slack) with approval gates and rollback
  • Multi-tenancy — enforces RBAC, per-tenant quotas, and data isolation

Integration Poller

Continuously polls external systems:

  • AWS CloudWatch — EC2, RDS, Lambda, ALB, DynamoDB, S3 metrics
  • Kubernetes — pod/node metrics, namespace events

RCA MCP Server

Exposes InfraSage's RCA capabilities as a Model Context Protocol (MCP) server, allowing AI agents to query root-cause analysis results programmatically.


Infrastructure Components

ComponentVersionPurpose
ClickHousev26+Time-series storage with MergeTree engine
Redpandav23.3+Kafka-compatible message broker
PrometheuslatestMetric scraping and alerting
GrafanalatestVisualization dashboards

Database Schema (Key Tables)

TablePurpose
infrasage_raw_firehoseAll ingested telemetry
infrasage_exemplarsHigh-cardinality trace/span data
infrasage_telemetry_catalogService/metric discovery
infrasage_anomaliesDetected anomalies with scores
infrasage_rca_resultsRCA outputs from Claude
infrasage_incidentsCorrelated incident groups
infrasage_knowledge_baseHistorical RCA learnings
infrasage_api_keysTenant API keys
infrasage_usage_meteringHourly event counts per tenant
infrasage_billing_plansFree/Starter/Pro/Enterprise definitions
infrasage_audit_logAll state changes (365-day TTL)

Technology Stack

LayerTechnology
LanguageGo 1.25 (statically compiled)
StorageClickHouse (columnar, MergeTree)
MessagingRedpanda / Kafka (franz-go client)
MonitoringPrometheus + Grafana
TracingOpenTelemetry
AIAnthropic Claude API
Vector IndexHNSW (custom Go implementation)
AuthJWT (golang.org/x/crypto)
ContainersDocker + Kubernetes
CloudAWS SDK v2