Runbooks & Automation

InfraSage can automatically execute remediation actions when anomalies are detected. Runbooks define sequences of steps — with optional human approval gates and automatic rollback.

Runbook Structure

A runbook is a named sequence of actions triggered by matching anomaly conditions.

{
  "name": "scale-out-payment-api",
  "description": "Scale up payment-api pods when CPU exceeds threshold",
  "trigger": {
    "service_id": "payment-api",
    "metric_name": "cpu_usage_percent",
    "condition": "anomaly_score > 0.7"
  },
  "steps": [
    {
      "type": "slack",
      "message": "Scaling payment-api from 3 to 5 pods due to CPU anomaly",
      "channel": "#alerts",
      "require_approval": false
    },
    {
      "type": "kubernetes",
      "action": "scale",
      "namespace": "production",
      "deployment": "payment-api",
      "replicas": 5,
      "require_approval": true,
      "approval_timeout_minutes": 10
    },
    {
      "type": "http",
      "method": "POST",
      "url": "https://api.pagerduty.com/incidents",
      "headers": { "Authorization": "Token $PAGERDUTY_API_TOKEN" },
      "body": { "type": "incident", "title": "payment-api scaled — monitoring" }
    }
  ],
  "rollback_on_failure": true,
  "dry_run": false
}

Action Types

`kubernetes`

Execute Kubernetes API operations:

Action	Description
`scale`	Set deployment replica count
`restart`	Delete and recreate pods
`rollout-undo`	Roll back to previous deployment revision
`cordon`	Mark node as unschedulable
`drain`	Drain node for maintenance

{
  "type": "kubernetes",
  "action": "scale",
  "namespace": "production",
  "deployment": "payment-api",
  "replicas": 5
}

`http`

Call any HTTP endpoint:

{
  "type": "http",
  "method": "POST",
  "url": "https://internal-api.mycompany.com/runbooks/restart-service",
  "headers": { "Authorization": "Bearer $SECRET" },
  "body": { "service": "payment-api" },
  "timeout_seconds": 30
}

`shell`

Run a shell command on a designated host:

{
  "type": "shell",
  "command": "systemctl restart payment-api",
  "host": "prod-host-01",
  "timeout_seconds": 60
}

:::warning Shell actions Shell actions require a configured SSH key or agent. They execute with the permissions of the configured service account. Use with care. :::

`slack`

Send a Slack message or request approval:

{
  "type": "slack",
  "channel": "#ops-alerts",
  "message": "Runbook triggered: scaling payment-api to 5 replicas",
  "require_approval": true,
  "approval_timeout_minutes": 15
}

Human-in-the-Loop Approval

When a step has require_approval: true:

InfraSage sends an interactive Slack message with Approve / Reject buttons
If approved within approval_timeout_minutes, the step executes
If rejected or timed out, the runbook halts at that step
The decision is recorded in the audit log with the approver's identity

Approval flow in Slack:

⚡ Runbook: scale-out-payment-api
Service: payment-api | Score: 0.93 | Step 2 of 3

Action: Scale deployment to 5 replicas
Namespace: production

[✅ Approve]  [❌ Reject]

Timeout in: 10 minutes

Dry-Run Mode

Test a runbook without executing any side effects:

curl -X POST $INFRASAGE_URL/api/v1/runbooks/execute \
  -H "Authorization: Bearer $OPERATOR_JWT" \
  -d '{
    "runbook_name": "scale-out-payment-api",
    "dry_run": true,
    "context": {
      "service_id": "payment-api",
      "anomaly_score": 0.85
    }
  }'

Dry-run returns what would have been executed without making any changes.

Automatic Rollback

If a step fails, or if post-action metrics worsen, InfraSage can automatically reverse the runbook:

{
  "rollback_on_failure": true,
  "rollback_metric_check": {
    "metric_name": "cpu_usage_percent",
    "check_after_minutes": 5,
    "rollback_if_value_exceeds": 90.0
  }
}

Rollback steps are derived from the forward steps in reverse order:

scale to 5 → rollback is scale to 3 (original count)
restart → rollback is recorded as manual (can't un-restart)

Execution History

View the full audit trail of runbook executions:

curl $INFRASAGE_URL/api/v1/runbooks/history \
  -H "Authorization: Bearer $YOUR_JWT" \
  -G --data-urlencode "service_id=payment-api" \
     --data-urlencode "limit=20"

Each entry includes:

Runbook name and version
Trigger anomaly ID
Steps executed and their outputs
Approver identity (if any)
Rollback status
Total execution time

Configuration Reference

Setting	Default	Description
`approval_timeout_minutes`	`10`	How long to wait for human approval
`rollback_on_failure`	`false`	Auto-rollback if any step fails
`dry_run`	`false`	Log only, no side effects
`max_concurrent_runbooks`	`3`	Max parallel runbooks per tenant

Runbook Structure​

Action Types​

kubernetes​

http​

shell​

slack​

Human-in-the-Loop Approval​

Dry-Run Mode​

Automatic Rollback​

Execution History​

Configuration Reference​