Skip to main content

Recipe: Payment API Monitoring

This end-to-end walkthrough sets up full-stack observability for a payment processing service — from metric ingestion through anomaly detection, root cause analysis, Slack notification, and automated runbook execution.

Time to complete: ~30 minutes
Prerequisites: InfraSage running (self-hosted or cloud), a Kubernetes cluster with a payment service deployed


What You'll Build

Payment Service

│ metrics + logs (every 15s)

Ingestion Gateway ──► Kafka ──► ClickHouse

Watchdog (60s poll)

Anomaly detected?

Yes ──► RCA Engine

Claude analysis

┌───────────────┼───────────────┐
▼ ▼ ▼
Slack alert Jira ticket Runbook executes
(#incidents) (auto-created) (scale replicas)

Step 1: Create a Tenant and API Key

# Create a tenant for your payment team
curl -X POST http://infrasage:8080/api/v1/tenants \
-H "Authorization: Bearer $ADMIN_JWT" \
-H "Content-Type: application/json" \
-d '{
"tenant_id": "payments-team",
"name": "Payments Engineering",
"plan": "pro"
}'

# Create an ingestion key
curl -X POST http://infrasage:8080/api/v1/keys \
-H "Authorization: Bearer $ADMIN_JWT" \
-H "Content-Type: application/json" \
-d '{
"tenant_id": "payments-team",
"name": "payment-service-ingest",
"scope": "ingestion"
}'

Save the returned key as INFRASAGE_API_KEY.


Step 2: Instrument Your Payment Service

Add a sidecar or instrument directly. Here's a minimal Go example that reports the key payment metrics:

// metrics/reporter.go
package metrics

import (
"bytes"
"encoding/json"
"net/http"
"time"
)

type Reporter struct {
gatewayURL string
apiKey string
serviceID string
}

func NewReporter(gatewayURL, apiKey, serviceID string) *Reporter {
return &Reporter{gatewayURL: gatewayURL, apiKey: apiKey, serviceID: serviceID}
}

func (r *Reporter) RecordPayment(latencyMs float64, success bool, provider string) {
status := "success"
if !success {
status = "failure"
}

events := []map[string]any{
{
"type": "metric",
"service_id": r.serviceID,
"metric_name": "payment_latency_ms",
"value": latencyMs,
"timestamp": time.Now().UnixMilli(),
"tags": map[string]string{"provider": provider, "status": status},
},
{
"type": "metric",
"service_id": r.serviceID,
"metric_name": "payment_success_rate",
"value": map[bool]float64{true: 1.0, false: 0.0}[success],
"timestamp": time.Now().UnixMilli(),
"tags": map[string]string{"provider": provider},
},
}

body, _ := json.Marshal(map[string]any{"events": events})
req, _ := http.NewRequest("POST", r.gatewayURL+"/api/v1/telemetry/batch", bytes.NewReader(body))
req.Header.Set("X-API-Key", r.apiKey)
req.Header.Set("Content-Type", "application/json")
http.DefaultClient.Do(req) // fire-and-forget
}

Key metrics to report:

MetricTypeWhat it signals
payment_latency_msmetricSlowdowns at provider or internal
payment_success_ratemetricProvider degradation or internal errors
payment_queue_depthmetricBackpressure / consumer lag
payment_error_countmetricError spikes
checkout_duration_msmetricEnd-to-end user-facing latency
Auth/charge log lineslogError context for RCA

Step 3: Configure Anomaly Detection

The defaults work, but payment services benefit from tighter thresholds:

# payment-service anomaly config
curl -X PUT http://infrasage:8080/api/v1/tenants/payments-team/services/payment-service/config \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"watchdog_z_score_threshold": 2.5,
"cooldown_seconds": 120,
"adaptive_threshold_enabled": true,
"metrics": {
"payment_success_rate": {
"z_score_threshold": 2.0,
"direction": "decrease_only"
},
"payment_latency_ms": {
"z_score_threshold": 2.5,
"direction": "increase_only"
}
}
}'

direction: decrease_only on payment_success_rate means InfraSage only alerts if the rate drops, not if it improves.


Step 4: Configure Slack Alerting

# Set Slack webhook
curl -X PUT http://infrasage:8080/api/v1/tenants/payments-team/integrations/slack \
-H "Authorization: Bearer $JWT" \
-d '{
"webhook_url": "https://hooks.slack.com/services/...",
"channel": "#payment-incidents",
"severity_filter": ["high", "critical"],
"mention_on_critical": "@payments-oncall"
}'

When an anomaly is detected on payment-service, you'll get a Slack message like:

🔴 [CRITICAL] payment-service — payment_success_rate
Anomaly detected: success rate dropped to 78% (Z-score: -3.8, baseline: 99.2%)
RCA in progress... · payments-team · eu-west-1

Step 5: Set Up a Runbook for Auto-Scaling

If payment latency spikes and RCA identifies pod saturation as the likely cause, automatically scale up:

curl -X POST http://infrasage:8080/api/v1/tenants/payments-team/runbooks \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"name": "Scale payment-service on latency spike",
"trigger": {
"service_id": "payment-service",
"metric_name": "payment_latency_ms",
"condition": "anomaly_detected",
"rca_cause_contains": "pod saturation"
},
"steps": [
{
"type": "kubernetes",
"action": "scale_deployment",
"namespace": "payments",
"deployment": "payment-service",
"replicas_delta": 2,
"max_replicas": 10
},
{
"type": "slack",
"message": "Auto-scaled payment-service to {{new_replicas}} replicas due to latency anomaly"
}
],
"approval_required": false,
"dry_run": false
}'

For destructive runbook steps (e.g., pod restarts, config changes), set "approval_required": true to get a Slack approval button before execution.


Step 6: Verify the Pipeline

Send a synthetic spike to test the full pipeline:

# Send normal baseline for 5 minutes
for i in $(seq 1 20); do
curl -s -X POST http://infrasage:8080/api/v1/telemetry \
-H "X-API-Key: $INFRASAGE_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"type\":\"metric\",\"service_id\":\"payment-service\",\"metric_name\":\"payment_latency_ms\",\"value\":$((RANDOM % 30 + 130)),\"timestamp\":$(date +%s000)}"
sleep 15
done

# Send a spike
curl -s -X POST http://infrasage:8080/api/v1/telemetry \
-H "X-API-Key: $INFRASAGE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"type":"metric","service_id":"payment-service","metric_name":"payment_latency_ms","value":980,"timestamp":'$(date +%s000)'}'

Within ~60 seconds (one Watchdog cycle), you should see:

  1. Anomaly created in the Admin UI
  2. RCA running (check /api/v1/anomalies for status: analyzing)
  3. Slack notification in #payment-incidents
  4. RCA complete with root cause summary

What Good Looks Like

After a week of baseline data, InfraSage's ML Engine builds a seasonal model for your payment metrics. You should see:

  • Adaptive thresholds that relax during low-traffic hours (nights/weekends) and tighten during peak hours
  • Forecasts showing expected latency ranges for the next 6 hours
  • Degradation trends surfacing slow creep in latency before it becomes an incident

Check the ML forecast endpoint:

curl "http://infrasage:8080/api/v1/ml/forecast?service_id=payment-service&metric=payment_latency_ms&horizon_minutes=360" \
-H "X-API-Key: $INFRASAGE_API_KEY"

Related: Anomaly Detection · Root Cause Analysis · Runbooks · Slack Integration