ML Engine
The InfraSage ML Engine extends anomaly detection with predictive analytics, model lifecycle management, and causal inference. It runs alongside the AIops Engine and exposes a dedicated REST API.
Capabilities
| Capability | Algorithm | Use Case |
|---|---|---|
| Anomaly detection | Isolation Forest | Detect multivariate anomalies |
| Anomaly prediction | XGBoost classifier | Predict anomalies before they occur |
| Time-series forecasting | ARIMA + exponential smoothing | Capacity planning, trend projection |
| Degradation trend detection | Linear regression on derivatives | Catch slow-burn degradations |
| Causal inference | Temporal cross-correlation | Discover cause→effect relationships |
| Blast radius estimation | Graph traversal | Predict failure propagation |
| Model drift detection | KL divergence on feature distributions | Know when to retrain |
REST API
Base URL: http://localhost:8080/api/v1/ml
Analyze an Incident
Full RCA + blast radius + predictions in one call:
curl -X POST http://localhost:8080/api/v1/ml/analyze \
-H "Content-Type: application/json" \
-d '{
"anomaly_id": "anom-7f3d",
"service_id": "api-gateway",
"metric_name": "cpu_usage_percent",
"timestamp": "2026-04-10T12:00:00Z",
"anomaly_score": 0.95
}'
Response includes root cause, suggested actions, historical matches, blast radius, causal relationships, and predicted future anomalies.
Predict Upcoming Anomalies
Get predicted anomalies for a service in the next N minutes:
curl "http://localhost:8080/api/v1/ml/predict?service_id=database&horizon_minutes=30"
{
"service_id": "database",
"predictions": [
{
"metric_name": "cpu_usage_percent",
"confidence_score": 0.84,
"prediction_reason": "CPU trending toward 90% threshold",
"suggested_runbook": "scale-up-compute",
"lead_time_minutes": 15
}
]
}
Detect Degradation Trends
Find metrics that are slowly trending toward a critical threshold:
curl "http://localhost:8080/api/v1/ml/degradation-trends?service_id=payment-api"
{
"trends": [
{
"metric_name": "memory_usage_percent",
"current_value": 72.4,
"trend_slope": 0.8,
"estimated_breach_minutes": 45,
"threshold": 90.0
}
]
}
Forecast a Metric
Get a time-series forecast with confidence intervals:
curl "http://localhost:8080/api/v1/ml/forecast?service_id=payment-api&metric=request_rate&horizon_hours=24"
{
"service_id": "payment-api",
"metric_name": "request_rate",
"forecast": [
{
"timestamp": "2026-04-11T00:00:00Z",
"predicted_value": 1250.3,
"lower_bound": 1100.0,
"upper_bound": 1400.0,
"confidence": 0.95
}
]
}
Discover Causal Relationships
curl "http://localhost:8080/api/v1/ml/causal?service_id=checkout-service"
{
"relationships": [
{
"cause_metric": "db_query_latency_ms",
"effect_metric": "response_time_ms",
"strength": 0.91,
"lag_minutes": 2,
"confidence": 0.88
}
]
}
Validate Causality
Test whether a suspected cause→effect relationship is statistically valid:
curl -X POST http://localhost:8080/api/v1/ml/validate-causality \
-H "Content-Type: application/json" \
-d '{
"service_id": "payment-api",
"cause_metric": "cpu_usage_percent",
"effect_metric": "error_rate",
"lookback_hours": 72
}'
Model Lifecycle
Training
Train a new model using recent ClickHouse data:
curl -X POST http://localhost:8080/api/v1/ml/train \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ADMIN_JWT" \
-d '{
"service_id": "api-gateway",
"training_window_days": 30,
"model_type": "isolation_forest"
}'
Shadow Deployment (A/B Testing)
New models deploy in shadow mode by default — they run in parallel with the production model and their predictions are logged but not acted upon. This lets you validate accuracy before promoting.
# Check shadow model performance
curl "http://localhost:8080/api/v1/ml/drift?service_id=api-gateway"
Promote a Model to Production
curl -X POST http://localhost:8080/api/v1/ml/promote \
-H "Authorization: Bearer $ADMIN_JWT" \
-d '{
"model_id": "model-abc123",
"service_id": "api-gateway"
}'
Model Drift Detection
InfraSage monitors model performance continuously. When the input feature distribution diverges from the training distribution (KL divergence exceeds threshold), an alert is raised and retraining is triggered automatically.
curl "http://localhost:8080/api/v1/ml/drift?service_id=api-gateway"
{
"service_id": "api-gateway",
"model_id": "model-abc123",
"drift_score": 0.12,
"drift_detected": false,
"last_trained": "2026-03-15T00:00:00Z",
"recommendation": "Model is healthy. Next scheduled review: 2026-05-15."
}
Blast Radius Estimation
Before executing a runbook or declaring an incident, InfraSage estimates how many services will be affected:
curl "http://localhost:8080/api/v1/ml/blast-radius?service_id=database&metric=cpu_usage_percent"
{
"origin_service": "database",
"directly_affected": ["payment-api", "user-service"],
"transitively_affected": ["checkout-service", "notification-service"],
"estimated_severity": "high",
"confidence": 0.87
}