Skip to main content

Root Cause Analysis (RCA)

When InfraSage detects an anomaly, it automatically triggers a root cause analysis using Anthropic Claude. RCA runs asynchronously and completes in approximately 20 seconds.


How RCA Works

Anomaly detected by Watchdog


1. Gather evidence ← correlated metrics, logs, events, blast radius


2. Vector similarity search ← find past incidents that look similar


3. Causal validation ← verify temporal precedence between signals


4. Send context to Claude ← structured prompt with all evidence


5. Parse RCA response ← root cause, confidence, actions, affected services


6. Store in ClickHouse ← `infrasage_rca_results`


7. Send notifications ← Slack, PagerDuty, Teams

Evidence Gathering

Before calling Claude, InfraSage collects:

Evidence TypeSourceDetail
Correlated metricsClickHouseOther metrics that spiked within ±5 min of the anomaly
Event correlationClickHouseKubernetes events, deployments near the anomaly time
Blast radiusService graphAll services transitively affected (computed from correlation data)
Historical matchesHNSW vector indexTop-k similar past incidents with their known resolutions
Feature vectorRing bufferRaw value, derivative, second derivative, hour of day, day of week

Evidence Scoring

Each piece of evidence is weighted:

FactorWeight
Correlated metrics (>0.8 correlation)High
Derivative value (rapid change)High
Hour of day (peak vs off-peak)Medium
Historical similarity matchMedium
Second derivative (acceleration)Low

Causal Inference

InfraSage validates causality before reporting it. Two signals are considered causally related only if:

  1. Signal A consistently precedes signal B (temporal precedence)
  2. The correlation is high (>0.7 by default)
  3. The lag between A and B is consistent across multiple incidents

This avoids false conclusions like "high error rate causes high CPU" when the true direction is reversed.


RCA Response Structure

{
"anomaly_id": "anom-7f3d",
"service_id": "payment-api",
"analyzed_at": "2026-04-10T12:00:20Z",
"root_cause": {
"category": "infrastructure",
"confidence": 0.92,
"summary": "CPU saturation on payment-api caused by memory pressure from uncached DB queries, leading to thread pool exhaustion and elevated error rates downstream in user-service.",
"evidence": {
"cpu_usage_high": 0.90,
"memory_pressure": 0.85,
"db_query_latency_spike": 0.78
},
"suggested_actions": [
"Scale payment-api horizontally from 3 to 5 pods",
"Restart affected pods to clear memory pressure",
"Review DB connection pool size in payment-api config"
],
"historical_matches": [
{
"incident_id": "inc-march-14",
"resolution": "Scaled from 3 to 5 pods; cleared in 12 minutes",
"similarity": 0.89,
"time_to_recovery_mins": 12
}
]
},
"blast_radius": ["user-service", "checkout-service"],
"causal_relationships": [
{
"cause_metric": "db_query_latency_ms",
"effect_metric": "cpu_usage_percent",
"strength": 0.84,
"lag_minutes": 3
}
]
}

Blast Radius

The blast radius is the set of services transitively affected by the root-cause service. InfraSage computes it from the service correlation graph stored in ClickHouse:

# Query blast radius for a service
curl "http://localhost:8080/api/v1/rca/correlations?service_id=payment-api" \
-H "Authorization: Bearer $YOUR_JWT"

Incident Memory (Human Feedback)

After resolving an incident, you can feed the resolution back to InfraSage so it improves future RCA:

curl -X POST http://localhost:9093/api/v1/resolutions/webhook \
-H "Content-Type: application/json" \
-d '{
"incident_id": "inc-7f3d",
"service_id": "payment-api",
"resolution": "Scaled payment-api from 3 to 5 pods. Root cause was memory leak in DB connection pool introduced in v2.3.1. Fixed by upgrading to v2.3.2.",
"resolved_by": "alice@mycompany.com",
"time_to_recovery_minutes": 18,
"tags": ["memory-leak", "db-pool", "payment-api"]
}'

This is stored in infrasage_knowledge_base and infrasage_incident_memory, and will surface in future RCA when similar patterns are detected.


Viewing RCA Results

Via Grafana

Open http://localhost:3000RCA Results dashboard.

Via ClickHouse SQL

SELECT
service_id,
root_cause_category,
root_cause_confidence,
root_cause_summary,
analyzed_at
FROM infrasage.infrasage_rca_results
WHERE analyzed_at > now() - INTERVAL 24 HOUR
ORDER BY analyzed_at DESC
LIMIT 20

Via MCP (AI Agent Integration)

The RCA MCP Server exposes RCA results to Claude-powered AI agents via the Model Context Protocol. See MCP Server docs for details.


Configuration

VariableDefaultDescription
ANTHROPIC_API_KEYRequired for LLM RCA
ANTHROPIC_MODELclaude-opus-4-6Claude model
WATCHDOG_RCA_COOLDOWN_MINUTES15Minimum gap between RCA runs per service/metric
VECTOR_HNSW_M16HNSW connectivity for incident similarity search
VECTORIZER_INTERVAL_SECONDS60How often to rebuild the vector index