ML Engine

The InfraSage ML Engine extends anomaly detection with predictive analytics, model lifecycle management, and causal inference. It runs alongside the AIops Engine and exposes a dedicated REST API.

Capabilities

Capability	Algorithm	Use Case
Anomaly detection	Isolation Forest	Detect multivariate anomalies
Anomaly prediction	XGBoost classifier	Predict anomalies before they occur
Time-series forecasting	ARIMA + exponential smoothing	Capacity planning, trend projection
Degradation trend detection	Linear regression on derivatives	Catch slow-burn degradations
Causal inference	Temporal cross-correlation	Discover cause→effect relationships
Blast radius estimation	Graph traversal	Predict failure propagation
Model drift detection	KL divergence on feature distributions	Know when to retrain

REST API

Base URL: $INFRASAGE_URL/api/v1/ml

Analyze an Incident

Full RCA + blast radius + predictions in one call:

curl -X POST $INFRASAGE_URL/api/v1/ml/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "anomaly_id": "anom-7f3d",
    "service_id": "api-gateway",
    "metric_name": "cpu_usage_percent",
    "timestamp": "2026-04-10T12:00:00Z",
    "anomaly_score": 0.95
  }'

Response includes root cause, suggested actions, historical matches, blast radius, causal relationships, and predicted future anomalies.

Predict Upcoming Anomalies

Get predicted anomalies for a service in the next N minutes:

curl "$INFRASAGE_URL/api/v1/ml/predict?service_id=database&horizon_minutes=30"

{
  "service_id": "database",
  "predictions": [
    {
      "metric_name": "cpu_usage_percent",
      "confidence_score": 0.84,
      "prediction_reason": "CPU trending toward 90% threshold",
      "suggested_runbook": "scale-up-compute",
      "lead_time_minutes": 15
    }
  ]
}

Detect Degradation Trends

Find metrics that are slowly trending toward a critical threshold:

curl "$INFRASAGE_URL/api/v1/ml/degradation-trends?service_id=payment-api"

{
  "trends": [
    {
      "metric_name": "memory_usage_percent",
      "current_value": 72.4,
      "trend_slope": 0.8,
      "estimated_breach_minutes": 45,
      "threshold": 90.0
    }
  ]
}

Forecast a Metric

Get a time-series forecast with confidence intervals:

curl "$INFRASAGE_URL/api/v1/ml/forecast?service_id=payment-api&metric=request_rate&horizon_hours=24"

{
  "service_id": "payment-api",
  "metric_name": "request_rate",
  "forecast": [
    {
      "timestamp": "2026-04-11T00:00:00Z",
      "predicted_value": 1250.3,
      "lower_bound": 1100.0,
      "upper_bound": 1400.0,
      "confidence": 0.95
    }
  ]
}

Discover Causal Relationships

curl "$INFRASAGE_URL/api/v1/ml/causal?service_id=checkout-service"

{
  "relationships": [
    {
      "cause_metric": "db_query_latency_ms",
      "effect_metric": "response_time_ms",
      "strength": 0.91,
      "lag_minutes": 2,
      "confidence": 0.88
    }
  ]
}

Validate Causality

Test whether a suspected cause→effect relationship is statistically valid:

curl -X POST $INFRASAGE_URL/api/v1/ml/validate-causality \
  -H "Content-Type: application/json" \
  -d '{
    "service_id": "payment-api",
    "cause_metric": "cpu_usage_percent",
    "effect_metric": "error_rate",
    "lookback_hours": 72
  }'

Model Lifecycle

Training

Train a new model using recent ClickHouse data:

curl -X POST $INFRASAGE_URL/api/v1/ml/train \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -d '{
    "service_id": "api-gateway",
    "training_window_days": 30,
    "model_type": "isolation_forest"
  }'

Shadow Deployment (A/B Testing)

New models deploy in shadow mode by default — they run in parallel with the production model and their predictions are logged but not acted upon. This lets you validate accuracy before promoting.

# Check shadow model performance
curl "$INFRASAGE_URL/api/v1/ml/drift?service_id=api-gateway"

Promote a Model to Production

curl -X POST $INFRASAGE_URL/api/v1/ml/promote \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -d '{
    "model_id": "model-abc123",
    "service_id": "api-gateway"
  }'

Model Drift Detection

InfraSage monitors model performance continuously. When the input feature distribution diverges from the training distribution (KL divergence exceeds threshold), an alert is raised and retraining is triggered automatically.

curl "$INFRASAGE_URL/api/v1/ml/drift?service_id=api-gateway"

{
  "service_id": "api-gateway",
  "model_id": "model-abc123",
  "drift_score": 0.12,
  "drift_detected": false,
  "last_trained": "2026-03-15T00:00:00Z",
  "recommendation": "Model is healthy. Next scheduled review: 2026-05-15."
}

Blast Radius Estimation

Before executing a runbook or declaring an incident, InfraSage estimates how many services will be affected:

curl "$INFRASAGE_URL/api/v1/ml/blast-radius?service_id=database&metric=cpu_usage_percent"

{
  "origin_service": "database",
  "directly_affected": ["payment-api", "user-service"],
  "transitively_affected": ["checkout-service", "notification-service"],
  "estimated_severity": "high",
  "confidence": 0.87
}

Capabilities​

REST API​

Analyze an Incident​

Predict Upcoming Anomalies​

Detect Degradation Trends​

Forecast a Metric​

Discover Causal Relationships​

Validate Causality​

Model Lifecycle​

Training​

Shadow Deployment (A/B Testing)​

Promote a Model to Production​

Model Drift Detection​

Blast Radius Estimation​