Causal Anomaly Detection (CIAD)

Causal Anomaly Detection (CIAD) is InfraSage's pre-fault detection system. While the standard watchdog asks "Is this metric value unusual?", CIAD asks a fundamentally different question: "Have the causal relationships between this service's signals broken down?"

In production, incidents rarely begin with a single metric crossing a threshold. They begin with a structural shift — throughput stops predicting latency, database health stops predicting error rate — several minutes before any individual metric becomes anomalous. CIAD detects these structural breakdowns, providing up to 3–10 minutes of early warning before a full incident materializes.

How It Works

Embedding dimensions (18-D vector per service per minute)
           │
           ▼
  Transfer Entropy computation
  (measures: does X causally precede Y?)
           │
           ▼
  Compare observed TE vs. expected TE
  from active causal invariants
           │
           ▼
  Causal Score = weighted violation fraction
           │
           ├─ score > 0.40 → Pre-Fault flag raised
           └─ score ≤ 0.40 → Normal

CIAD runs as a leader-gated background worker, scoring all active services every 2 minutes. Scores are persisted to ClickHouse and exposed via API and the Watchdog UI.

Transfer Entropy

The causal measure used is Transfer Entropy (TE):

T(X→Y) = Σ p(y', y, x_lag) × log₂[ p(y' | y, x_lag) / p(y' | y) ]

This quantifies how much knowing the history of X reduces uncertainty about the future of Y — above and beyond what Y's own history already tells you.

Implementation details:

Time series discretized into 3 bins (low / mid / high) using service-specific 33rd and 67th percentiles
Laplace smoothing (+1 per bin) prevents log-of-zero for sparse signals
Lags of 1, 2, and 5 minutes are each tested; the maximum TE across lags is used
Pure Go implementation, ~5 ms per service at runtime

Causal Dimensions

CIAD operates on 13 of the 18 embedding dimensions. The 5 temporal dimensions (sin/cos time-of-day and day-of-week encodings, dims 8–12) are excluded because they are deterministic and carry no causal information.

Dim	Signal	Description
0	`avg_value`	Average normalized metric value
1	`max_value`	Peak metric value
2	`error_rate`	Error fraction
3	`warn_rate`	Warning-level event fraction
4	`log_ratio`	Log event rate relative to baseline
5	`trace_ratio`	Trace span rate relative to baseline
6	`inbound_edges`	Normalized inbound traffic
7	`outbound_edges`	Normalized outbound traffic
13	`p99_latency`	99th-percentile latency
14	`http_error_rate`	HTTP 4xx/5xx normalized rate
15	`queue_depth_peak`	Queue depth normalized
16	`db_health`	Database health score normalized
17	`fatal_rate`	Fatal/critical log event rate

Causal Invariants

A causal invariant is a causal relationship that holds across the fleet. For example: "throughput (dim 0) consistently predicts p99 latency (dim 13) with a 2-minute lag across 80% of fleet services".

CIAD maintains two types of invariants:

Seed Invariants

Seven universal relationships hardcoded from domain knowledge. These are active from day one, before enough fleet data exists for statistical discovery:

Invariant	Cause	Effect	Expected TE	Lag
`d0_d13_lag2`	avg_value	p99_latency	0.15	2 min
`d13_d2_lag2`	p99_latency	error_rate	0.12	2 min
`d16_d13_lag1`	db_health	p99_latency	0.18	1 min
`d15_d0_lag3`	queue_depth_peak	avg_value	0.10	3 min
`d2_d17_lag2`	error_rate	fatal_rate	0.14	2 min
`d0_d16_lag3`	avg_value	db_health	0.11	3 min
`d14_d2_lag1`	http_error_rate	error_rate	0.20	1 min

Fleet-Discovered Invariants

Once ≥ 10 services each have ≥ 14 days of embedding history, CIAD runs nightly fleet discovery (daily at UTC 00:05). It evaluates all 156 possible causal pairs across the fleet and emits invariants where:

Prevalence ≥ 60% — the relationship holds in at least 60% of qualifying services
Median TE ≥ 0.05 — the relationship is statistically meaningful

Fleet-discovered invariants are stored in ClickHouse alongside seeds, with a is_seed = false flag. They automatically replace seeds with more accurate, fleet-specific baselines over time.

Causal Score

For each service, CIAD computes a score from 0 to 1:

For each active invariant:
  observedTE = max(TE at lag 1, 2, 5 min)
  if observedTE < 0.5 × expectedTE:
    violation = (expectedTE − observedTE) / expectedTE
  else:
    violation = 0

CausalScore = Σ (invariant.weight × violation) / Σ invariant.weight

A violation only fires when the observed TE drops below 50% of the expected value. Minor natural drift (e.g., observed = 65% of expected) is not flagged, preventing false positives.

invariant.weight = expectedTE × prevalence — more prevalent and stronger invariants have proportionally more influence.

Pre-Fault Threshold

Score	Status
0.00–0.39	Normal — causal structure intact
0.40–1.00	Pre-Fault — causal relationships breaking down

The threshold of 0.40 is set lower than the weirdness-score anomaly threshold (0.65), which is what gives CIAD its head start.

Viewing CIAD Results

Via the UI

Causal Detection page (sidebar) shows:

Active invariant count and fleet coverage
Invariants table with cause→effect chain, expected TE, lag, and prevalence bar
Service causal scores sorted by score descending, with Pre-Fault badges
Trigger Discovery button to manually kick off fleet discovery

Watchdog page shows an amber ⚡ Causal Pre-Fault banner on any incident panel where the pre-fault flag is active, plus a Causal score column in the Elevated Services table.

Service Detail page shows a 120-minute causal score history chart with the 0.40 threshold line marked.

Via API

# Active invariants
curl $INFRASAGE_URL/api/v1/ciad/invariants \
  -H "Authorization: Bearer $TOKEN"

# Latest causal scores (all services)
curl $INFRASAGE_URL/api/v1/ciad/scores \
  -H "Authorization: Bearer $TOKEN"

# Historical scores for one service (last 60 min)
curl "$INFRASAGE_URL/api/v1/services/payment-api/causal-history?minutes=60" \
  -H "Authorization: Bearer $TOKEN"

# Trigger fleet discovery manually (async, returns 202)
curl -X POST $INFRASAGE_URL/api/v1/ciad/discover \
  -H "Authorization: Bearer $TOKEN"

Example causal score response:

{
  "service_id": "payment-api",
  "window_timestamp": "2026-05-09T04:28:00Z",
  "causal_score": 0.61,
  "violation_count": 4,
  "total_invariants": 7,
  "pre_fault": true
}

Via ClickHouse SQL

-- Recent pre-fault events
SELECT service_id, window_timestamp, causal_score, violation_count
FROM infrasage_causal_anomaly_scores
WHERE pre_fault = 1
  AND window_timestamp >= now() - INTERVAL 24 HOUR
ORDER BY window_timestamp DESC;

-- Active invariants
SELECT invariant_id, cause_dim, effect_dim, expected_te, lag_minutes,
       prevalence, is_seed, weight
FROM infrasage_causal_invariants FINAL
ORDER BY weight DESC;

Relationship to the Standard Watchdog

CIAD and the standard anomaly watchdog are independent, parallel detection signals. They never modify each other's scores:

	Standard Watchdog	CIAD
Question asked	Is this metric value unusual?	Has the causal structure broken down?
Signal type	Univariate (per-metric Z-score)	Multivariate (cross-signal causal graph)
Detection threshold	Weirdness score > 0.65	Causal score > 0.40
Typical lead time	Fires at or after incident onset	Fires 3–10 min before incident onset
False positive risk	Low (3-sigma baseline)	Low (50% TE drop required for violation)
Cold start	Requires ~24h of baseline data	Requires ~14 days of embedding history

An incident that triggers both systems means high confidence. A CIAD-only pre-fault flag is an early warning to investigate before a full incident develops.

Operational Notes

CIAD is leader-gated — only the elected leader pod runs the 2-minute scoring loop. If the leader changes, scoring resumes within one election cycle (~30 s).
Fleet discovery runs as a standalone goroutine (not leader-gated) — all pods schedule it independently. Because the invariant table uses ReplacingMergeTree, duplicate writes from multiple pods are harmless.
If CIAD is disabled (e.g., schema migration not applied), all CIAD API endpoints return {"error":"ciad_disabled"} with HTTP 503. The watchdog and TQS continue operating normally.
The pre-fault score is included in the existing /api/v1/watchdog/summary response as causal_score and causal_pre_fault fields on each service snapshot (omitted when not available).

Telemetry Quality and CIAD

CIAD requires at least 14 days of 18-dimensional embedding history at ≥ 50% fill rate. The Telemetry Quality Score Causal Readiness dimension (0–10) reflects this readiness directly — a service scoring 10/10 has sufficient history for CIAD to operate with all available invariants.

Telemetry Quality Scoring — data quality system that CIAD readiness is part of
Anomaly Detection — the parallel statistical and ML watchdog
Root Cause Analysis — triggered by confirmed incidents from either system

How It Works​

Transfer Entropy​

Causal Dimensions​

Causal Invariants​

Seed Invariants​

Fleet-Discovered Invariants​

Causal Score​

Pre-Fault Threshold​

Viewing CIAD Results​

Via the UI​

Via API​

Via ClickHouse SQL​

Relationship to the Standard Watchdog​

Operational Notes​

Telemetry Quality and CIAD​

Related​