Causal Anomaly Detection (CIAD)
Causal Anomaly Detection (CIAD) is InfraSage's pre-fault detection system. While the standard watchdog asks "Is this metric value unusual?", CIAD asks a fundamentally different question: "Have the causal relationships between this service's signals broken down?"
In production, incidents rarely begin with a single metric crossing a threshold. They begin with a structural shift — throughput stops predicting latency, database health stops predicting error rate — several minutes before any individual metric becomes anomalous. CIAD detects these structural breakdowns, providing up to 3–10 minutes of early warning before a full incident materializes.
How It Works
Embedding dimensions (18-D vector per service per minute)
│
▼
Transfer Entropy computation
(measures: does X causally precede Y?)
│
▼
Compare observed TE vs. expected TE
from active causal invariants
│
▼
Causal Score = weighted violation fraction
│
├─ score > 0.40 → Pre-Fault flag raised
└─ score ≤ 0.40 → Normal
CIAD runs as a leader-gated background worker, scoring all active services every 2 minutes. Scores are persisted to ClickHouse and exposed via API and the Watchdog UI.
Transfer Entropy
The causal measure used is Transfer Entropy (TE):
T(X→Y) = Σ p(y', y, x_lag) × log₂[ p(y' | y, x_lag) / p(y' | y) ]
This quantifies how much knowing the history of X reduces uncertainty about the future of Y — above and beyond what Y's own history already tells you.
Implementation details:
- Time series discretized into 3 bins (low / mid / high) using service-specific 33rd and 67th percentiles
- Laplace smoothing (+1 per bin) prevents log-of-zero for sparse signals
- Lags of 1, 2, and 5 minutes are each tested; the maximum TE across lags is used
- Pure Go implementation, ~5 ms per service at runtime
Causal Dimensions
CIAD operates on 13 of the 18 embedding dimensions. The 5 temporal dimensions (sin/cos time-of-day and day-of-week encodings, dims 8–12) are excluded because they are deterministic and carry no causal information.
| Dim | Signal | Description |
|---|---|---|
| 0 | avg_value | Average normalized metric value |
| 1 | max_value | Peak metric value |
| 2 | error_rate | Error fraction |
| 3 | warn_rate | Warning-level event fraction |
| 4 | log_ratio | Log event rate relative to baseline |
| 5 | trace_ratio | Trace span rate relative to baseline |
| 6 | inbound_edges | Normalized inbound traffic |
| 7 | outbound_edges | Normalized outbound traffic |
| 13 | p99_latency | 99th-percentile latency |
| 14 | http_error_rate | HTTP 4xx/5xx normalized rate |
| 15 | queue_depth_peak | Queue depth normalized |
| 16 | db_health | Database health score normalized |
| 17 | fatal_rate | Fatal/critical log event rate |
Causal Invariants
A causal invariant is a causal relationship that holds across the fleet. For example: "throughput (dim 0) consistently predicts p99 latency (dim 13) with a 2-minute lag across 80% of fleet services".
CIAD maintains two types of invariants:
Seed Invariants
Seven universal relationships hardcoded from domain knowledge. These are active from day one, before enough fleet data exists for statistical discovery:
| Invariant | Cause | Effect | Expected TE | Lag |
|---|---|---|---|---|
d0_d13_lag2 | avg_value | p99_latency | 0.15 | 2 min |
d13_d2_lag2 | p99_latency | error_rate | 0.12 | 2 min |
d16_d13_lag1 | db_health | p99_latency | 0.18 | 1 min |
d15_d0_lag3 | queue_depth_peak | avg_value | 0.10 | 3 min |
d2_d17_lag2 | error_rate | fatal_rate | 0.14 | 2 min |
d0_d16_lag3 | avg_value | db_health | 0.11 | 3 min |
d14_d2_lag1 | http_error_rate | error_rate | 0.20 | 1 min |
Fleet-Discovered Invariants
Once ≥ 10 services each have ≥ 14 days of embedding history, CIAD runs nightly fleet discovery (daily at UTC 00:05). It evaluates all 156 possible causal pairs across the fleet and emits invariants where:
- Prevalence ≥ 60% — the relationship holds in at least 60% of qualifying services
- Median TE ≥ 0.05 — the relationship is statistically meaningful
Fleet-discovered invariants are stored in ClickHouse alongside seeds, with a is_seed = false flag. They automatically replace seeds with more accurate, fleet-specific baselines over time.
Causal Score
For each service, CIAD computes a score from 0 to 1:
For each active invariant:
observedTE = max(TE at lag 1, 2, 5 min)
if observedTE < 0.5 × expectedTE:
violation = (expectedTE − observedTE) / expectedTE
else:
violation = 0
CausalScore = Σ (invariant.weight × violation) / Σ invariant.weight
A violation only fires when the observed TE drops below 50% of the expected value. Minor natural drift (e.g., observed = 65% of expected) is not flagged, preventing false positives.
invariant.weight = expectedTE × prevalence — more prevalent and stronger invariants have proportionally more influence.
Pre-Fault Threshold
| Score | Status |
|---|---|
| 0.00–0.39 | Normal — causal structure intact |
| 0.40–1.00 | Pre-Fault — causal relationships breaking down |
The threshold of 0.40 is set lower than the weirdness-score anomaly threshold (0.65), which is what gives CIAD its head start.
Viewing CIAD Results
Via the UI
Causal Detection page (sidebar) shows:
- Active invariant count and fleet coverage
- Invariants table with cause→effect chain, expected TE, lag, and prevalence bar
- Service causal scores sorted by score descending, with Pre-Fault badges
- Trigger Discovery button to manually kick off fleet discovery
Watchdog page shows an amber ⚡ Causal Pre-Fault banner on any incident panel where the pre-fault flag is active, plus a Causal score column in the Elevated Services table.
Service Detail page shows a 120-minute causal score history chart with the 0.40 threshold line marked.
Via API
# Active invariants
curl $INFRASAGE_URL/api/v1/ciad/invariants \
-H "Authorization: Bearer $TOKEN"
# Latest causal scores (all services)
curl $INFRASAGE_URL/api/v1/ciad/scores \
-H "Authorization: Bearer $TOKEN"
# Historical scores for one service (last 60 min)
curl "$INFRASAGE_URL/api/v1/services/payment-api/causal-history?minutes=60" \
-H "Authorization: Bearer $TOKEN"
# Trigger fleet discovery manually (async, returns 202)
curl -X POST $INFRASAGE_URL/api/v1/ciad/discover \
-H "Authorization: Bearer $TOKEN"
Example causal score response:
{
"service_id": "payment-api",
"window_timestamp": "2026-05-09T04:28:00Z",
"causal_score": 0.61,
"violation_count": 4,
"total_invariants": 7,
"pre_fault": true
}
Via ClickHouse SQL
-- Recent pre-fault events
SELECT service_id, window_timestamp, causal_score, violation_count
FROM infrasage_causal_anomaly_scores
WHERE pre_fault = 1
AND window_timestamp >= now() - INTERVAL 24 HOUR
ORDER BY window_timestamp DESC;
-- Active invariants
SELECT invariant_id, cause_dim, effect_dim, expected_te, lag_minutes,
prevalence, is_seed, weight
FROM infrasage_causal_invariants FINAL
ORDER BY weight DESC;
Relationship to the Standard Watchdog
CIAD and the standard anomaly watchdog are independent, parallel detection signals. They never modify each other's scores:
| Standard Watchdog | CIAD | |
|---|---|---|
| Question asked | Is this metric value unusual? | Has the causal structure broken down? |
| Signal type | Univariate (per-metric Z-score) | Multivariate (cross-signal causal graph) |
| Detection threshold | Weirdness score > 0.65 | Causal score > 0.40 |
| Typical lead time | Fires at or after incident onset | Fires 3–10 min before incident onset |
| False positive risk | Low (3-sigma baseline) | Low (50% TE drop required for violation) |
| Cold start | Requires ~24h of baseline data | Requires ~14 days of embedding history |
An incident that triggers both systems means high confidence. A CIAD-only pre-fault flag is an early warning to investigate before a full incident develops.
Operational Notes
- CIAD is leader-gated — only the elected leader pod runs the 2-minute scoring loop. If the leader changes, scoring resumes within one election cycle (~30 s).
- Fleet discovery runs as a standalone goroutine (not leader-gated) — all pods schedule it independently. Because the invariant table uses
ReplacingMergeTree, duplicate writes from multiple pods are harmless. - If CIAD is disabled (e.g., schema migration not applied), all CIAD API endpoints return
{"error":"ciad_disabled"}with HTTP 503. The watchdog and TQS continue operating normally. - The pre-fault score is included in the existing
/api/v1/watchdog/summaryresponse ascausal_scoreandcausal_pre_faultfields on each service snapshot (omitted when not available).
Telemetry Quality and CIAD
CIAD requires at least 14 days of 18-dimensional embedding history at ≥ 50% fill rate. The Telemetry Quality Score Causal Readiness dimension (0–10) reflects this readiness directly — a service scoring 10/10 has sufficient history for CIAD to operate with all available invariants.
Related
- Telemetry Quality Scoring — data quality system that CIAD readiness is part of
- Anomaly Detection — the parallel statistical and ML watchdog
- Root Cause Analysis — triggered by confirmed incidents from either system