Recipe: Payment API Monitoring

This end-to-end walkthrough sets up full-stack observability for a payment processing service — from metric ingestion through anomaly detection, root cause analysis, Slack notification, and automated runbook execution.

Time to complete: ~30 minutes
Prerequisites: InfraSage running (self-hosted or cloud), a Kubernetes cluster with a payment service deployed

What You'll Build

Payment Service
     │
     │ metrics + logs (every 15s)
     ▼
Ingestion Gateway ──► Kafka ──► ClickHouse
                                     │
                              Watchdog (60s poll)
                                     │
                          Anomaly detected?
                                     │
                            Yes ──► RCA Engine
                                     │
                              Claude analysis
                                     │
                     ┌───────────────┼───────────────┐
                     ▼               ▼               ▼
                Slack alert     Jira ticket    Runbook executes
               (#incidents)   (auto-created)  (scale replicas)

Step 1: Create a Tenant and API Key

# Create a tenant for your payment team
curl -X POST http://infrasage:8080/api/v1/tenants \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "tenant_id": "payments-team",
    "name": "Payments Engineering",
    "plan": "pro"
  }'

# Create an ingestion key
curl -X POST http://infrasage:8080/api/v1/keys \
  -H "Authorization: Bearer $ADMIN_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "tenant_id": "payments-team",
    "name": "payment-service-ingest",
    "scope": "ingestion"
  }'

Save the returned key as INFRASAGE_API_KEY.

Step 2: Instrument Your Payment Service

Add a sidecar or instrument directly. Here's a minimal Go example that reports the key payment metrics:

// metrics/reporter.go
package metrics

import (
    "bytes"
    "encoding/json"
    "net/http"
    "time"
)

type Reporter struct {
    gatewayURL string
    apiKey     string
    serviceID  string
}

func NewReporter(gatewayURL, apiKey, serviceID string) *Reporter {
    return &Reporter{gatewayURL: gatewayURL, apiKey: apiKey, serviceID: serviceID}
}

func (r *Reporter) RecordPayment(latencyMs float64, success bool, provider string) {
    status := "success"
    if !success {
        status = "failure"
    }

    events := []map[string]any{
        {
            "type":        "metric",
            "service_id":  r.serviceID,
            "metric_name": "payment_latency_ms",
            "value":       latencyMs,
            "timestamp":   time.Now().UnixMilli(),
            "tags":        map[string]string{"provider": provider, "status": status},
        },
        {
            "type":        "metric",
            "service_id":  r.serviceID,
            "metric_name": "payment_success_rate",
            "value":       map[bool]float64{true: 1.0, false: 0.0}[success],
            "timestamp":   time.Now().UnixMilli(),
            "tags":        map[string]string{"provider": provider},
        },
    }

    body, _ := json.Marshal(map[string]any{"events": events})
    req, _ := http.NewRequest("POST", r.gatewayURL+"/api/v1/telemetry/batch", bytes.NewReader(body))
    req.Header.Set("X-API-Key", r.apiKey)
    req.Header.Set("Content-Type", "application/json")
    http.DefaultClient.Do(req) // fire-and-forget
}

Key metrics to report:

Metric	Type	What it signals
`payment_latency_ms`	metric	Slowdowns at provider or internal
`payment_success_rate`	metric	Provider degradation or internal errors
`payment_queue_depth`	metric	Backpressure / consumer lag
`payment_error_count`	metric	Error spikes
`checkout_duration_ms`	metric	End-to-end user-facing latency
Auth/charge log lines	log	Error context for RCA

Step 3: Configure Anomaly Detection

The defaults work, but payment services benefit from tighter thresholds:

# payment-service anomaly config
curl -X PUT http://infrasage:8080/api/v1/tenants/payments-team/services/payment-service/config \
  -H "Authorization: Bearer $JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "watchdog_z_score_threshold": 2.5,
    "cooldown_seconds": 120,
    "adaptive_threshold_enabled": true,
    "metrics": {
      "payment_success_rate": {
        "z_score_threshold": 2.0,
        "direction": "decrease_only"
      },
      "payment_latency_ms": {
        "z_score_threshold": 2.5,
        "direction": "increase_only"
      }
    }
  }'

direction: decrease_only on payment_success_rate means InfraSage only alerts if the rate drops, not if it improves.

Step 4: Configure Slack Alerting

# Set Slack webhook
curl -X PUT http://infrasage:8080/api/v1/tenants/payments-team/integrations/slack \
  -H "Authorization: Bearer $JWT" \
  -d '{
    "webhook_url": "https://hooks.slack.com/services/...",
    "channel": "#payment-incidents",
    "severity_filter": ["high", "critical"],
    "mention_on_critical": "@payments-oncall"
  }'

When an anomaly is detected on payment-service, you'll get a Slack message like:

🔴 [CRITICAL] payment-service — payment_success_rate
Anomaly detected: success rate dropped to 78% (Z-score: -3.8, baseline: 99.2%)
RCA in progress... · payments-team · eu-west-1

Step 5: Set Up a Runbook for Auto-Scaling

If payment latency spikes and RCA identifies pod saturation as the likely cause, automatically scale up:

curl -X POST http://infrasage:8080/api/v1/tenants/payments-team/runbooks \
  -H "Authorization: Bearer $JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Scale payment-service on latency spike",
    "trigger": {
      "service_id": "payment-service",
      "metric_name": "payment_latency_ms",
      "condition": "anomaly_detected",
      "rca_cause_contains": "pod saturation"
    },
    "steps": [
      {
        "type": "kubernetes",
        "action": "scale_deployment",
        "namespace": "payments",
        "deployment": "payment-service",
        "replicas_delta": 2,
        "max_replicas": 10
      },
      {
        "type": "slack",
        "message": "Auto-scaled payment-service to {{new_replicas}} replicas due to latency anomaly"
      }
    ],
    "approval_required": false,
    "dry_run": false
  }'

For destructive runbook steps (e.g., pod restarts, config changes), set "approval_required": true to get a Slack approval button before execution.

Step 6: Verify the Pipeline

Send a synthetic spike to test the full pipeline:

# Send normal baseline for 5 minutes
for i in $(seq 1 20); do
  curl -s -X POST http://infrasage:8080/api/v1/telemetry \
    -H "X-API-Key: $INFRASAGE_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"type\":\"metric\",\"service_id\":\"payment-service\",\"metric_name\":\"payment_latency_ms\",\"value\":$((RANDOM % 30 + 130)),\"timestamp\":$(date +%s000)}"
  sleep 15
done

# Send a spike
curl -s -X POST http://infrasage:8080/api/v1/telemetry \
  -H "X-API-Key: $INFRASAGE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"type":"metric","service_id":"payment-service","metric_name":"payment_latency_ms","value":980,"timestamp":'$(date +%s000)'}'

Within ~60 seconds (one Watchdog cycle), you should see:

Anomaly created in the Admin UI
RCA running (check /api/v1/anomalies for status: analyzing)
Slack notification in #payment-incidents
RCA complete with root cause summary

What Good Looks Like

After a week of baseline data, InfraSage's ML Engine builds a seasonal model for your payment metrics. You should see:

Adaptive thresholds that relax during low-traffic hours (nights/weekends) and tighten during peak hours
Forecasts showing expected latency ranges for the next 6 hours
Degradation trends surfacing slow creep in latency before it becomes an incident

Check the ML forecast endpoint:

curl "http://infrasage:8080/api/v1/ml/forecast?service_id=payment-service&metric=payment_latency_ms&horizon_minutes=360" \
  -H "X-API-Key: $INFRASAGE_API_KEY"

What You'll Build​

Step 1: Create a Tenant and API Key​

Step 2: Instrument Your Payment Service​

Step 3: Configure Anomaly Detection​

Step 4: Configure Slack Alerting​

Step 5: Set Up a Runbook for Auto-Scaling​

Step 6: Verify the Pipeline​

What Good Looks Like​