Skip to main content

Causal Anomaly Detection (CIAD)

Causal Anomaly Detection (CIAD) is InfraSage's pre-fault detection system. While the standard watchdog asks "Is this metric value unusual?", CIAD asks a fundamentally different question: "Have the causal relationships between this service's signals broken down?"

In production, incidents rarely begin with a single metric crossing a threshold. They begin with a structural shift — throughput stops predicting latency, database health stops predicting error rate — several minutes before any individual metric becomes anomalous. CIAD detects these structural breakdowns, providing up to 3–10 minutes of early warning before a full incident materializes.


How It Works

Embedding dimensions (18-D vector per service per minute)


Transfer Entropy computation
(measures: does X causally precede Y?)


Compare observed TE vs. expected TE
from active causal invariants


Causal Score = weighted violation fraction

├─ score > 0.40 → Pre-Fault flag raised
└─ score ≤ 0.40 → Normal

CIAD runs as a leader-gated background worker, scoring all active services every 2 minutes. Scores are persisted to ClickHouse and exposed via API and the Watchdog UI.


Transfer Entropy

The causal measure used is Transfer Entropy (TE):

T(X→Y) = Σ p(y', y, x_lag) × log₂[ p(y' | y, x_lag) / p(y' | y) ]

This quantifies how much knowing the history of X reduces uncertainty about the future of Y — above and beyond what Y's own history already tells you.

Implementation details:

  • Time series discretized into 3 bins (low / mid / high) using service-specific 33rd and 67th percentiles
  • Laplace smoothing (+1 per bin) prevents log-of-zero for sparse signals
  • Lags of 1, 2, and 5 minutes are each tested; the maximum TE across lags is used
  • Pure Go implementation, ~5 ms per service at runtime

Causal Dimensions

CIAD operates on 13 of the 18 embedding dimensions. The 5 temporal dimensions (sin/cos time-of-day and day-of-week encodings, dims 8–12) are excluded because they are deterministic and carry no causal information.

DimSignalDescription
0avg_valueAverage normalized metric value
1max_valuePeak metric value
2error_rateError fraction
3warn_rateWarning-level event fraction
4log_ratioLog event rate relative to baseline
5trace_ratioTrace span rate relative to baseline
6inbound_edgesNormalized inbound traffic
7outbound_edgesNormalized outbound traffic
13p99_latency99th-percentile latency
14http_error_rateHTTP 4xx/5xx normalized rate
15queue_depth_peakQueue depth normalized
16db_healthDatabase health score normalized
17fatal_rateFatal/critical log event rate

Causal Invariants

A causal invariant is a causal relationship that holds across the fleet. For example: "throughput (dim 0) consistently predicts p99 latency (dim 13) with a 2-minute lag across 80% of fleet services".

CIAD maintains two types of invariants:

Seed Invariants

Seven universal relationships hardcoded from domain knowledge. These are active from day one, before enough fleet data exists for statistical discovery:

InvariantCauseEffectExpected TELag
d0_d13_lag2avg_valuep99_latency0.152 min
d13_d2_lag2p99_latencyerror_rate0.122 min
d16_d13_lag1db_healthp99_latency0.181 min
d15_d0_lag3queue_depth_peakavg_value0.103 min
d2_d17_lag2error_ratefatal_rate0.142 min
d0_d16_lag3avg_valuedb_health0.113 min
d14_d2_lag1http_error_rateerror_rate0.201 min

Fleet-Discovered Invariants

Once ≥ 10 services each have ≥ 14 days of embedding history, CIAD runs nightly fleet discovery (daily at UTC 00:05). It evaluates all 156 possible causal pairs across the fleet and emits invariants where:

  • Prevalence ≥ 60% — the relationship holds in at least 60% of qualifying services
  • Median TE ≥ 0.05 — the relationship is statistically meaningful

Fleet-discovered invariants are stored in ClickHouse alongside seeds, with a is_seed = false flag. They automatically replace seeds with more accurate, fleet-specific baselines over time.


Causal Score

For each service, CIAD computes a score from 0 to 1:

For each active invariant:
observedTE = max(TE at lag 1, 2, 5 min)
if observedTE < 0.5 × expectedTE:
violation = (expectedTE − observedTE) / expectedTE
else:
violation = 0

CausalScore = Σ (invariant.weight × violation) / Σ invariant.weight

A violation only fires when the observed TE drops below 50% of the expected value. Minor natural drift (e.g., observed = 65% of expected) is not flagged, preventing false positives.

invariant.weight = expectedTE × prevalence — more prevalent and stronger invariants have proportionally more influence.

Pre-Fault Threshold

ScoreStatus
0.00–0.39Normal — causal structure intact
0.40–1.00Pre-Fault — causal relationships breaking down

The threshold of 0.40 is set lower than the weirdness-score anomaly threshold (0.65), which is what gives CIAD its head start.


Viewing CIAD Results

Via the UI

Causal Detection page (sidebar) shows:

  • Active invariant count and fleet coverage
  • Invariants table with cause→effect chain, expected TE, lag, and prevalence bar
  • Service causal scores sorted by score descending, with Pre-Fault badges
  • Trigger Discovery button to manually kick off fleet discovery

Watchdog page shows an amber ⚡ Causal Pre-Fault banner on any incident panel where the pre-fault flag is active, plus a Causal score column in the Elevated Services table.

Service Detail page shows a 120-minute causal score history chart with the 0.40 threshold line marked.

Via API

# Active invariants
curl $INFRASAGE_URL/api/v1/ciad/invariants \
-H "Authorization: Bearer $TOKEN"

# Latest causal scores (all services)
curl $INFRASAGE_URL/api/v1/ciad/scores \
-H "Authorization: Bearer $TOKEN"

# Historical scores for one service (last 60 min)
curl "$INFRASAGE_URL/api/v1/services/payment-api/causal-history?minutes=60" \
-H "Authorization: Bearer $TOKEN"

# Trigger fleet discovery manually (async, returns 202)
curl -X POST $INFRASAGE_URL/api/v1/ciad/discover \
-H "Authorization: Bearer $TOKEN"

Example causal score response:

{
"service_id": "payment-api",
"window_timestamp": "2026-05-09T04:28:00Z",
"causal_score": 0.61,
"violation_count": 4,
"total_invariants": 7,
"pre_fault": true
}

Via ClickHouse SQL

-- Recent pre-fault events
SELECT service_id, window_timestamp, causal_score, violation_count
FROM infrasage_causal_anomaly_scores
WHERE pre_fault = 1
AND window_timestamp >= now() - INTERVAL 24 HOUR
ORDER BY window_timestamp DESC;

-- Active invariants
SELECT invariant_id, cause_dim, effect_dim, expected_te, lag_minutes,
prevalence, is_seed, weight
FROM infrasage_causal_invariants FINAL
ORDER BY weight DESC;

Relationship to the Standard Watchdog

CIAD and the standard anomaly watchdog are independent, parallel detection signals. They never modify each other's scores:

Standard WatchdogCIAD
Question askedIs this metric value unusual?Has the causal structure broken down?
Signal typeUnivariate (per-metric Z-score)Multivariate (cross-signal causal graph)
Detection thresholdWeirdness score > 0.65Causal score > 0.40
Typical lead timeFires at or after incident onsetFires 3–10 min before incident onset
False positive riskLow (3-sigma baseline)Low (50% TE drop required for violation)
Cold startRequires ~24h of baseline dataRequires ~14 days of embedding history

An incident that triggers both systems means high confidence. A CIAD-only pre-fault flag is an early warning to investigate before a full incident develops.


Operational Notes

  • CIAD is leader-gated — only the elected leader pod runs the 2-minute scoring loop. If the leader changes, scoring resumes within one election cycle (~30 s).
  • Fleet discovery runs as a standalone goroutine (not leader-gated) — all pods schedule it independently. Because the invariant table uses ReplacingMergeTree, duplicate writes from multiple pods are harmless.
  • If CIAD is disabled (e.g., schema migration not applied), all CIAD API endpoints return {"error":"ciad_disabled"} with HTTP 503. The watchdog and TQS continue operating normally.
  • The pre-fault score is included in the existing /api/v1/watchdog/summary response as causal_score and causal_pre_fault fields on each service snapshot (omitted when not available).

Telemetry Quality and CIAD

CIAD requires at least 14 days of 18-dimensional embedding history at ≥ 50% fill rate. The Telemetry Quality Score Causal Readiness dimension (0–10) reflects this readiness directly — a service scoring 10/10 has sufficient history for CIAD to operate with all available invariants.