Skip to main content

Runbooks & Automation

InfraSage can automatically execute remediation actions when anomalies are detected. Runbooks define sequences of steps — with optional human approval gates and automatic rollback.


Runbook Structure

A runbook is a named sequence of actions triggered by matching anomaly conditions.

{
"name": "scale-out-payment-api",
"description": "Scale up payment-api pods when CPU exceeds threshold",
"trigger": {
"service_id": "payment-api",
"metric_name": "cpu_usage_percent",
"condition": "anomaly_score > 0.7"
},
"steps": [
{
"type": "slack",
"message": "Scaling payment-api from 3 to 5 pods due to CPU anomaly",
"channel": "#alerts",
"require_approval": false
},
{
"type": "kubernetes",
"action": "scale",
"namespace": "production",
"deployment": "payment-api",
"replicas": 5,
"require_approval": true,
"approval_timeout_minutes": 10
},
{
"type": "http",
"method": "POST",
"url": "https://api.pagerduty.com/incidents",
"headers": { "Authorization": "Token $PAGERDUTY_API_TOKEN" },
"body": { "type": "incident", "title": "payment-api scaled — monitoring" }
}
],
"rollback_on_failure": true,
"dry_run": false
}

Action Types

kubernetes

Execute Kubernetes API operations:

ActionDescription
scaleSet deployment replica count
restartDelete and recreate pods
rollout-undoRoll back to previous deployment revision
cordonMark node as unschedulable
drainDrain node for maintenance
{
"type": "kubernetes",
"action": "scale",
"namespace": "production",
"deployment": "payment-api",
"replicas": 5
}

http

Call any HTTP endpoint:

{
"type": "http",
"method": "POST",
"url": "https://internal-api.mycompany.com/runbooks/restart-service",
"headers": { "Authorization": "Bearer $SECRET" },
"body": { "service": "payment-api" },
"timeout_seconds": 30
}

shell

Run a shell command on a designated host:

{
"type": "shell",
"command": "systemctl restart payment-api",
"host": "prod-host-01",
"timeout_seconds": 60
}

:::warning Shell actions Shell actions require a configured SSH key or agent. They execute with the permissions of the configured service account. Use with care. :::

slack

Send a Slack message or request approval:

{
"type": "slack",
"channel": "#ops-alerts",
"message": "Runbook triggered: scaling payment-api to 5 replicas",
"require_approval": true,
"approval_timeout_minutes": 15
}

Human-in-the-Loop Approval

When a step has require_approval: true:

  1. InfraSage sends an interactive Slack message with Approve / Reject buttons
  2. If approved within approval_timeout_minutes, the step executes
  3. If rejected or timed out, the runbook halts at that step
  4. The decision is recorded in the audit log with the approver's identity

Approval flow in Slack:

⚡ Runbook: scale-out-payment-api
Service: payment-api | Score: 0.93 | Step 2 of 3

Action: Scale deployment to 5 replicas
Namespace: production

[✅ Approve] [❌ Reject]

Timeout in: 10 minutes

Dry-Run Mode

Test a runbook without executing any side effects:

curl -X POST http://localhost:8080/api/v1/runbooks/execute \
-H "Authorization: Bearer $OPERATOR_JWT" \
-d '{
"runbook_name": "scale-out-payment-api",
"dry_run": true,
"context": {
"service_id": "payment-api",
"anomaly_score": 0.85
}
}'

Dry-run returns what would have been executed without making any changes.


Automatic Rollback

If a step fails, or if post-action metrics worsen, InfraSage can automatically reverse the runbook:

{
"rollback_on_failure": true,
"rollback_metric_check": {
"metric_name": "cpu_usage_percent",
"check_after_minutes": 5,
"rollback_if_value_exceeds": 90.0
}
}

Rollback steps are derived from the forward steps in reverse order:

  • scale to 5 → rollback is scale to 3 (original count)
  • restart → rollback is recorded as manual (can't un-restart)

Execution History

View the full audit trail of runbook executions:

curl http://localhost:8080/api/v1/runbooks/history \
-H "Authorization: Bearer $YOUR_JWT" \
-G --data-urlencode "service_id=payment-api" \
--data-urlencode "limit=20"

Each entry includes:

  • Runbook name and version
  • Trigger anomaly ID
  • Steps executed and their outputs
  • Approver identity (if any)
  • Rollback status
  • Total execution time

Configuration Reference

SettingDefaultDescription
approval_timeout_minutes10How long to wait for human approval
rollback_on_failurefalseAuto-rollback if any step fails
dry_runfalseLog only, no side effects
max_concurrent_runbooks3Max parallel runbooks per tenant