Runbooks & Automation
InfraSage can automatically execute remediation actions when anomalies are detected. Runbooks define sequences of steps — with optional human approval gates and automatic rollback.
Runbook Structure
A runbook is a named sequence of actions triggered by matching anomaly conditions.
{
"name": "scale-out-payment-api",
"description": "Scale up payment-api pods when CPU exceeds threshold",
"trigger": {
"service_id": "payment-api",
"metric_name": "cpu_usage_percent",
"condition": "anomaly_score > 0.7"
},
"steps": [
{
"type": "slack",
"message": "Scaling payment-api from 3 to 5 pods due to CPU anomaly",
"channel": "#alerts",
"require_approval": false
},
{
"type": "kubernetes",
"action": "scale",
"namespace": "production",
"deployment": "payment-api",
"replicas": 5,
"require_approval": true,
"approval_timeout_minutes": 10
},
{
"type": "http",
"method": "POST",
"url": "https://api.pagerduty.com/incidents",
"headers": { "Authorization": "Token $PAGERDUTY_API_TOKEN" },
"body": { "type": "incident", "title": "payment-api scaled — monitoring" }
}
],
"rollback_on_failure": true,
"dry_run": false
}
Action Types
kubernetes
Execute Kubernetes API operations:
| Action | Description |
|---|---|
scale | Set deployment replica count |
restart | Delete and recreate pods |
rollout-undo | Roll back to previous deployment revision |
cordon | Mark node as unschedulable |
drain | Drain node for maintenance |
{
"type": "kubernetes",
"action": "scale",
"namespace": "production",
"deployment": "payment-api",
"replicas": 5
}
http
Call any HTTP endpoint:
{
"type": "http",
"method": "POST",
"url": "https://internal-api.mycompany.com/runbooks/restart-service",
"headers": { "Authorization": "Bearer $SECRET" },
"body": { "service": "payment-api" },
"timeout_seconds": 30
}
shell
Run a shell command on a designated host:
{
"type": "shell",
"command": "systemctl restart payment-api",
"host": "prod-host-01",
"timeout_seconds": 60
}
:::warning Shell actions Shell actions require a configured SSH key or agent. They execute with the permissions of the configured service account. Use with care. :::
slack
Send a Slack message or request approval:
{
"type": "slack",
"channel": "#ops-alerts",
"message": "Runbook triggered: scaling payment-api to 5 replicas",
"require_approval": true,
"approval_timeout_minutes": 15
}
Human-in-the-Loop Approval
When a step has require_approval: true:
- InfraSage sends an interactive Slack message with Approve / Reject buttons
- If approved within
approval_timeout_minutes, the step executes - If rejected or timed out, the runbook halts at that step
- The decision is recorded in the audit log with the approver's identity
Approval flow in Slack:
⚡ Runbook: scale-out-payment-api
Service: payment-api | Score: 0.93 | Step 2 of 3
Action: Scale deployment to 5 replicas
Namespace: production
[✅ Approve] [❌ Reject]
Timeout in: 10 minutes
Dry-Run Mode
Test a runbook without executing any side effects:
curl -X POST http://localhost:8080/api/v1/runbooks/execute \
-H "Authorization: Bearer $OPERATOR_JWT" \
-d '{
"runbook_name": "scale-out-payment-api",
"dry_run": true,
"context": {
"service_id": "payment-api",
"anomaly_score": 0.85
}
}'
Dry-run returns what would have been executed without making any changes.
Automatic Rollback
If a step fails, or if post-action metrics worsen, InfraSage can automatically reverse the runbook:
{
"rollback_on_failure": true,
"rollback_metric_check": {
"metric_name": "cpu_usage_percent",
"check_after_minutes": 5,
"rollback_if_value_exceeds": 90.0
}
}
Rollback steps are derived from the forward steps in reverse order:
scaleto 5 → rollback isscaleto 3 (original count)restart→ rollback is recorded as manual (can't un-restart)
Execution History
View the full audit trail of runbook executions:
curl http://localhost:8080/api/v1/runbooks/history \
-H "Authorization: Bearer $YOUR_JWT" \
-G --data-urlencode "service_id=payment-api" \
--data-urlencode "limit=20"
Each entry includes:
- Runbook name and version
- Trigger anomaly ID
- Steps executed and their outputs
- Approver identity (if any)
- Rollback status
- Total execution time
Configuration Reference
| Setting | Default | Description |
|---|---|---|
approval_timeout_minutes | 10 | How long to wait for human approval |
rollback_on_failure | false | Auto-rollback if any step fails |
dry_run | false | Log only, no side effects |
max_concurrent_runbooks | 3 | Max parallel runbooks per tenant |