Skip to main content

ML Engine

The InfraSage ML Engine extends anomaly detection with predictive analytics, model lifecycle management, and causal inference. It runs alongside the AIops Engine and exposes a dedicated REST API.


Capabilities

CapabilityAlgorithmUse Case
Anomaly detectionIsolation ForestDetect multivariate anomalies
Anomaly predictionXGBoost classifierPredict anomalies before they occur
Time-series forecastingARIMA + exponential smoothingCapacity planning, trend projection
Degradation trend detectionLinear regression on derivativesCatch slow-burn degradations
Causal inferenceTemporal cross-correlationDiscover cause→effect relationships
Blast radius estimationGraph traversalPredict failure propagation
Model drift detectionKL divergence on feature distributionsKnow when to retrain

REST API

Base URL: http://localhost:8080/api/v1/ml

Analyze an Incident

Full RCA + blast radius + predictions in one call:

curl -X POST http://localhost:8080/api/v1/ml/analyze \
-H "Content-Type: application/json" \
-d '{
"anomaly_id": "anom-7f3d",
"service_id": "api-gateway",
"metric_name": "cpu_usage_percent",
"timestamp": "2026-04-10T12:00:00Z",
"anomaly_score": 0.95
}'

Response includes root cause, suggested actions, historical matches, blast radius, causal relationships, and predicted future anomalies.


Predict Upcoming Anomalies

Get predicted anomalies for a service in the next N minutes:

curl "http://localhost:8080/api/v1/ml/predict?service_id=database&horizon_minutes=30"
{
"service_id": "database",
"predictions": [
{
"metric_name": "cpu_usage_percent",
"confidence_score": 0.84,
"prediction_reason": "CPU trending toward 90% threshold",
"suggested_runbook": "scale-up-compute",
"lead_time_minutes": 15
}
]
}

Find metrics that are slowly trending toward a critical threshold:

curl "http://localhost:8080/api/v1/ml/degradation-trends?service_id=payment-api"
{
"trends": [
{
"metric_name": "memory_usage_percent",
"current_value": 72.4,
"trend_slope": 0.8,
"estimated_breach_minutes": 45,
"threshold": 90.0
}
]
}

Forecast a Metric

Get a time-series forecast with confidence intervals:

curl "http://localhost:8080/api/v1/ml/forecast?service_id=payment-api&metric=request_rate&horizon_hours=24"
{
"service_id": "payment-api",
"metric_name": "request_rate",
"forecast": [
{
"timestamp": "2026-04-11T00:00:00Z",
"predicted_value": 1250.3,
"lower_bound": 1100.0,
"upper_bound": 1400.0,
"confidence": 0.95
}
]
}

Discover Causal Relationships

curl "http://localhost:8080/api/v1/ml/causal?service_id=checkout-service"
{
"relationships": [
{
"cause_metric": "db_query_latency_ms",
"effect_metric": "response_time_ms",
"strength": 0.91,
"lag_minutes": 2,
"confidence": 0.88
}
]
}

Validate Causality

Test whether a suspected cause→effect relationship is statistically valid:

curl -X POST http://localhost:8080/api/v1/ml/validate-causality \
-H "Content-Type: application/json" \
-d '{
"service_id": "payment-api",
"cause_metric": "cpu_usage_percent",
"effect_metric": "error_rate",
"lookback_hours": 72
}'

Model Lifecycle

Training

Train a new model using recent ClickHouse data:

curl -X POST http://localhost:8080/api/v1/ml/train \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ADMIN_JWT" \
-d '{
"service_id": "api-gateway",
"training_window_days": 30,
"model_type": "isolation_forest"
}'

Shadow Deployment (A/B Testing)

New models deploy in shadow mode by default — they run in parallel with the production model and their predictions are logged but not acted upon. This lets you validate accuracy before promoting.

# Check shadow model performance
curl "http://localhost:8080/api/v1/ml/drift?service_id=api-gateway"

Promote a Model to Production

curl -X POST http://localhost:8080/api/v1/ml/promote \
-H "Authorization: Bearer $ADMIN_JWT" \
-d '{
"model_id": "model-abc123",
"service_id": "api-gateway"
}'

Model Drift Detection

InfraSage monitors model performance continuously. When the input feature distribution diverges from the training distribution (KL divergence exceeds threshold), an alert is raised and retraining is triggered automatically.

curl "http://localhost:8080/api/v1/ml/drift?service_id=api-gateway"
{
"service_id": "api-gateway",
"model_id": "model-abc123",
"drift_score": 0.12,
"drift_detected": false,
"last_trained": "2026-03-15T00:00:00Z",
"recommendation": "Model is healthy. Next scheduled review: 2026-05-15."
}

Blast Radius Estimation

Before executing a runbook or declaring an incident, InfraSage estimates how many services will be affected:

curl "http://localhost:8080/api/v1/ml/blast-radius?service_id=database&metric=cpu_usage_percent"
{
"origin_service": "database",
"directly_affected": ["payment-api", "user-service"],
"transitively_affected": ["checkout-service", "notification-service"],
"estimated_severity": "high",
"confidence": 0.87
}