Root Cause Analysis (RCA)
When InfraSage detects an anomaly, it automatically triggers a root cause analysis using Anthropic Claude. RCA runs asynchronously and completes in approximately 20 seconds.
How RCA Works
Anomaly detected by Watchdog
│
▼
1. Gather evidence ← correlated metrics, logs, events, blast radius
│
▼
2. Vector similarity search ← find past incidents that look similar
│
▼
3. Causal validation ← verify temporal precedence between signals
│
▼
4. Send context to Claude ← structured prompt with all evidence
│
▼
5. Parse RCA response ← root cause, confidence, actions, affected services
│
▼
6. Store in ClickHouse ← `infrasage_rca_results`
│
▼
7. Send notifications ← Slack, PagerDuty, Teams
Evidence Gathering
Before calling Claude, InfraSage collects:
| Evidence Type | Source | Detail |
|---|---|---|
| Correlated metrics | ClickHouse | Other metrics that spiked within ±5 min of the anomaly |
| Event correlation | ClickHouse | Kubernetes events, deployments near the anomaly time |
| Blast radius | Service graph | All services transitively affected (computed from correlation data) |
| Historical matches | HNSW vector index | Top-k similar past incidents with their known resolutions |
| Feature vector | Ring buffer | Raw value, derivative, second derivative, hour of day, day of week |
Evidence Scoring
Each piece of evidence is weighted:
| Factor | Weight |
|---|---|
| Correlated metrics (>0.8 correlation) | High |
| Derivative value (rapid change) | High |
| Hour of day (peak vs off-peak) | Medium |
| Historical similarity match | Medium |
| Second derivative (acceleration) | Low |
Causal Inference
InfraSage validates causality before reporting it. Two signals are considered causally related only if:
- Signal A consistently precedes signal B (temporal precedence)
- The correlation is high (>0.7 by default)
- The lag between A and B is consistent across multiple incidents
This avoids false conclusions like "high error rate causes high CPU" when the true direction is reversed.
RCA Response Structure
{
"anomaly_id": "anom-7f3d",
"service_id": "payment-api",
"analyzed_at": "2026-04-10T12:00:20Z",
"root_cause": {
"category": "infrastructure",
"confidence": 0.92,
"summary": "CPU saturation on payment-api caused by memory pressure from uncached DB queries, leading to thread pool exhaustion and elevated error rates downstream in user-service.",
"evidence": {
"cpu_usage_high": 0.90,
"memory_pressure": 0.85,
"db_query_latency_spike": 0.78
},
"suggested_actions": [
"Scale payment-api horizontally from 3 to 5 pods",
"Restart affected pods to clear memory pressure",
"Review DB connection pool size in payment-api config"
],
"historical_matches": [
{
"incident_id": "inc-march-14",
"resolution": "Scaled from 3 to 5 pods; cleared in 12 minutes",
"similarity": 0.89,
"time_to_recovery_mins": 12
}
]
},
"blast_radius": ["user-service", "checkout-service"],
"causal_relationships": [
{
"cause_metric": "db_query_latency_ms",
"effect_metric": "cpu_usage_percent",
"strength": 0.84,
"lag_minutes": 3
}
]
}
Blast Radius
The blast radius is the set of services transitively affected by the root-cause service. InfraSage computes it from the service correlation graph stored in ClickHouse:
# Query blast radius for a service
curl "http://localhost:8080/api/v1/rca/correlations?service_id=payment-api" \
-H "Authorization: Bearer $YOUR_JWT"
Incident Memory (Human Feedback)
After resolving an incident, you can feed the resolution back to InfraSage so it improves future RCA:
curl -X POST http://localhost:9093/api/v1/resolutions/webhook \
-H "Content-Type: application/json" \
-d '{
"incident_id": "inc-7f3d",
"service_id": "payment-api",
"resolution": "Scaled payment-api from 3 to 5 pods. Root cause was memory leak in DB connection pool introduced in v2.3.1. Fixed by upgrading to v2.3.2.",
"resolved_by": "alice@mycompany.com",
"time_to_recovery_minutes": 18,
"tags": ["memory-leak", "db-pool", "payment-api"]
}'
This is stored in infrasage_knowledge_base and infrasage_incident_memory, and will surface in future RCA when similar patterns are detected.
Viewing RCA Results
Via Grafana
Open http://localhost:3000 → RCA Results dashboard.
Via ClickHouse SQL
SELECT
service_id,
root_cause_category,
root_cause_confidence,
root_cause_summary,
analyzed_at
FROM infrasage.infrasage_rca_results
WHERE analyzed_at > now() - INTERVAL 24 HOUR
ORDER BY analyzed_at DESC
LIMIT 20
Via MCP (AI Agent Integration)
The RCA MCP Server exposes RCA results to Claude-powered AI agents via the Model Context Protocol. See MCP Server docs for details.
Configuration
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY | — | Required for LLM RCA |
ANTHROPIC_MODEL | claude-opus-4-6 | Claude model |
WATCHDOG_RCA_COOLDOWN_MINUTES | 15 | Minimum gap between RCA runs per service/metric |
VECTOR_HNSW_M | 16 | HNSW connectivity for incident similarity search |
VECTORIZER_INTERVAL_SECONDS | 60 | How often to rebuild the vector index |