Root Cause Analysis (RCA)

When InfraSage detects an anomaly, it automatically triggers a root cause analysis using Anthropic Claude. RCA runs asynchronously and completes in approximately 20 seconds.

How RCA Works

Anomaly detected by Watchdog
          │
          ▼
  1. Gather evidence          ← correlated metrics, logs, events, blast radius
          │
          ▼
  2. Vector similarity search ← find past incidents that look similar
          │
          ▼
  3. Causal validation        ← verify temporal precedence between signals
          │
          ▼
  4. Send context to Claude   ← structured prompt with all evidence
          │
          ▼
  5. Parse RCA response       ← root cause, confidence, actions, affected services
          │
          ▼
  6. Store in ClickHouse      ← `infrasage_rca_results`
          │
          ▼
  7. Send notifications       ← Slack, PagerDuty, Teams

Evidence Gathering

Before calling Claude, InfraSage collects:

Evidence Type	Source	Detail
Correlated metrics	ClickHouse	Other metrics that spiked within ±5 min of the anomaly
Event correlation	ClickHouse	Kubernetes events, deployments near the anomaly time
Blast radius	Service graph	All services transitively affected (computed from correlation data)
Historical matches	HNSW vector index	Top-k similar past incidents with their known resolutions
Feature vector	Ring buffer	Raw value, derivative, second derivative, hour of day, day of week

Evidence Scoring

Each piece of evidence is weighted:

Factor	Weight
Correlated metrics (>0.8 correlation)	High
Derivative value (rapid change)	High
Hour of day (peak vs off-peak)	Medium
Historical similarity match	Medium
Second derivative (acceleration)	Low

Causal Inference

InfraSage validates causality before reporting it. Two signals are considered causally related only if:

Signal A consistently precedes signal B (temporal precedence)
The correlation is high (>0.7 by default)
The lag between A and B is consistent across multiple incidents

This avoids false conclusions like "high error rate causes high CPU" when the true direction is reversed.

RCA Response Structure

{
  "anomaly_id": "anom-7f3d",
  "service_id": "payment-api",
  "analyzed_at": "2026-04-10T12:00:20Z",
  "root_cause": {
    "category": "infrastructure",
    "confidence": 0.92,
    "summary": "CPU saturation on payment-api caused by memory pressure from uncached DB queries, leading to thread pool exhaustion and elevated error rates downstream in user-service.",
    "evidence": {
      "cpu_usage_high": 0.90,
      "memory_pressure": 0.85,
      "db_query_latency_spike": 0.78
    },
    "suggested_actions": [
      "Scale payment-api horizontally from 3 to 5 pods",
      "Restart affected pods to clear memory pressure",
      "Review DB connection pool size in payment-api config"
    ],
    "historical_matches": [
      {
        "incident_id": "inc-march-14",
        "resolution": "Scaled from 3 to 5 pods; cleared in 12 minutes",
        "similarity": 0.89,
        "time_to_recovery_mins": 12
      }
    ]
  },
  "blast_radius": ["user-service", "checkout-service"],
  "causal_relationships": [
    {
      "cause_metric": "db_query_latency_ms",
      "effect_metric": "cpu_usage_percent",
      "strength": 0.84,
      "lag_minutes": 3
    }
  ]
}

Blast Radius

The blast radius is the set of services transitively affected by the root-cause service. InfraSage computes it from the service correlation graph stored in ClickHouse:

# Query blast radius for a service
curl "$INFRASAGE_URL/api/v1/rca/correlations?service_id=payment-api" \
  -H "Authorization: Bearer $YOUR_JWT"

Incident Memory (Human Feedback)

After resolving an incident, you can feed the resolution back to InfraSage so it improves future RCA:

curl -X POST $INFRASAGE_URL/api/v1/resolutions/webhook \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "inc-7f3d",
    "service_id": "payment-api",
    "resolution": "Scaled payment-api from 3 to 5 pods. Root cause was memory leak in DB connection pool introduced in v2.3.1. Fixed by upgrading to v2.3.2.",
    "resolved_by": "[email protected]",
    "time_to_recovery_minutes": 18,
    "tags": ["memory-leak", "db-pool", "payment-api"]
  }'

This is stored in infrasage_knowledge_base and infrasage_incident_memory, and will surface in future RCA when similar patterns are detected.

Viewing RCA Results

Via Grafana

Open $GRAFANA_URL → RCA Results dashboard.

Via ClickHouse SQL

SELECT
  service_id,
  root_cause_category,
  root_cause_confidence,
  root_cause_summary,
  analyzed_at
FROM infrasage.infrasage_rca_results
WHERE analyzed_at > now() - INTERVAL 24 HOUR
ORDER BY analyzed_at DESC
LIMIT 20

Via MCP (AI Agent Integration)

The RCA MCP Server exposes RCA results to Claude-powered AI agents via the Model Context Protocol. See MCP Server docs for details.

Configuration

Variable	Default	Description
`ANTHROPIC_API_KEY`	—	Required for LLM RCA
`ANTHROPIC_MODEL`	`claude-opus-4-6`	Claude model
`WATCHDOG_RCA_COOLDOWN_MINUTES`	`15`	Minimum gap between RCA runs per service/metric
`VECTOR_HNSW_M`	`16`	HNSW connectivity for incident similarity search
`VECTORIZER_INTERVAL_SECONDS`	`60`	How often to rebuild the vector index

How RCA Works​

Evidence Gathering​

Evidence Scoring​

Causal Inference​

RCA Response Structure​

Blast Radius​

Incident Memory (Human Feedback)​

Viewing RCA Results​

Via Grafana​

Via ClickHouse SQL​

Via MCP (AI Agent Integration)​

Configuration​