Skip to main content

Anomaly Detection

InfraSage uses a multi-layer anomaly detection pipeline that combines statistical methods with unsupervised machine learning. Detection latency is under 100 milliseconds.


How It Works

The Watchdog runs on a configurable interval (default: every 60 seconds) and performs three passes:

ClickHouse telemetry


1. Statistical Watchdog (Z-score per metric)


2. Isolation Forest (multivariate ML scoring)


3. Adaptive Thresholds (seasonal + infrastructure-aware)


Anomaly declared → trigger RCA

Layer 1 — Statistical Watchdog (Z-Score)

For each (service_id, metric_name) pair, the Watchdog maintains a ring buffer of recent values. On each poll cycle, it computes the Z-score of the latest value:

Z = (current_value - rolling_mean) / rolling_stddev

If |Z| > WATCHDOG_Z_SCORE_THRESHOLD (default 3.0), an anomaly is declared.

Ring buffer details:

  • Holds the last N samples per metric (configurable window)
  • Resets when a service starts reporting metrics for the first time
  • Handles gaps gracefully — a gap does not reset the buffer

Sensitivity tuning

ThresholdBehavior
2.0Very sensitive — many false positives
3.0 (default)Balanced — flags 3-sigma deviations
4.0Conservative — only flags extreme outliers
5.0Very conservative — production-critical services only
# Set via environment variable
WATCHDOG_Z_SCORE_THRESHOLD=3.5

Layer 2 — Isolation Forest

Isolation Forest is an unsupervised ML algorithm that identifies anomalies by randomly partitioning the feature space. Points that are isolated quickly (short average path length) are anomalies.

InfraSage runs Isolation Forest across multiple metrics simultaneously for a single service, enabling it to detect multivariate anomalies that would be invisible to per-metric Z-scores (e.g., CPU is slightly elevated and error rate is slightly elevated and latency is slightly elevated — each individually normal, but together anomalous).

Features used:

  • Raw metric value
  • First derivative (rate of change)
  • Second derivative (acceleration)
  • Hour of day (encoded)
  • Day of week (encoded)
  • Correlated metric values

Layer 3 — Adaptive Thresholds

Adaptive thresholds extend Z-score with awareness of:

  • Seasonality — expected higher CPU on weekday mornings, lower on weekends
  • Infrastructure events — a spike after a known deployment is expected; the baseline adapts
  • Rolling baselines — thresholds automatically tighten as variance decreases over time

Adaptive thresholds are built from ClickHouse historical data and recalculated on each Watchdog cycle.


RCA Cooldown

After an anomaly triggers RCA for a (service_id, metric_name) pair, InfraSage enforces a cooldown period before triggering RCA for the same pair again (default: 15 minutes). This prevents RCA spam during sustained incidents.

WATCHDOG_RCA_COOLDOWN_MINUTES=15

During the cooldown window, anomalies are still detected and stored in ClickHouse — they just don't trigger another RCA cycle.


Anomaly Score

Each detected anomaly has a numeric score between 0 and 1:

ScoreMeaning
0.0–0.4Mild deviation — logged, not alerted
0.4–0.7Moderate — alert triggered
0.7–1.0Severe — alert + RCA + runbook evaluation

Viewing Anomalies

Via Grafana

Open http://localhost:3000Anomaly Detection dashboard. Shows:

  • Anomaly timeline by service
  • Score heatmap
  • Top anomalous metrics

Via ClickHouse SQL

SELECT
service_id,
metric_name,
anomaly_score,
z_score,
timestamp
FROM infrasage.infrasage_anomalies
WHERE timestamp > now() - INTERVAL 1 HOUR
ORDER BY anomaly_score DESC
LIMIT 50

Via API

curl http://localhost:8080/api/v1/anomalies \
-H "Authorization: Bearer $YOUR_JWT" \
-G --data-urlencode "service_id=payment-api" \
--data-urlencode "since=2026-04-10T00:00:00Z"

Dead-Letter Queue (DLQ)

Records that fail validation are not silently dropped — they are stored in the DLQ with their full payload. This lets you audit validation failures and replay corrected records.

# Check DLQ stats
curl http://localhost:8080/api/v1/debug/dlq-stats

# Response
{
"total_failed": 142,
"by_reason": {
"timestamp_too_old": 98,
"invalid_value": 31,
"missing_service_id": 13
},
"oldest_entry": "2026-04-09T08:00:00Z"
}