Recipe: Payment API Monitoring
This end-to-end walkthrough sets up full-stack observability for a payment processing service — from metric ingestion through anomaly detection, root cause analysis, Slack notification, and automated runbook execution.
Time to complete: ~30 minutes
Prerequisites: InfraSage running (self-hosted or cloud), a Kubernetes cluster with a payment service deployed
What You'll Build
Payment Service
│
│ metrics + logs (every 15s)
▼
Ingestion Gateway ──► Kafka ──► ClickHouse
│
Watchdog (60s poll)
│
Anomaly detected?
│
Yes ──► RCA Engine
│
Claude analysis
│
┌───────────────┼───────────────┐
▼ ▼ ▼
Slack alert Jira ticket Runbook executes
(#incidents) (auto-created) (scale replicas)
Step 1: Create a Tenant and API Key
# Create a tenant for your payment team
curl -X POST http://infrasage:8080/api/v1/tenants \
-H "Authorization: Bearer $ADMIN_JWT" \
-H "Content-Type: application/json" \
-d '{
"tenant_id": "payments-team",
"name": "Payments Engineering",
"plan": "pro"
}'
# Create an ingestion key
curl -X POST http://infrasage:8080/api/v1/keys \
-H "Authorization: Bearer $ADMIN_JWT" \
-H "Content-Type: application/json" \
-d '{
"tenant_id": "payments-team",
"name": "payment-service-ingest",
"scope": "ingestion"
}'
Save the returned key as INFRASAGE_API_KEY.
Step 2: Instrument Your Payment Service
Add a sidecar or instrument directly. Here's a minimal Go example that reports the key payment metrics:
// metrics/reporter.go
package metrics
import (
"bytes"
"encoding/json"
"net/http"
"time"
)
type Reporter struct {
gatewayURL string
apiKey string
serviceID string
}
func NewReporter(gatewayURL, apiKey, serviceID string) *Reporter {
return &Reporter{gatewayURL: gatewayURL, apiKey: apiKey, serviceID: serviceID}
}
func (r *Reporter) RecordPayment(latencyMs float64, success bool, provider string) {
status := "success"
if !success {
status = "failure"
}
events := []map[string]any{
{
"type": "metric",
"service_id": r.serviceID,
"metric_name": "payment_latency_ms",
"value": latencyMs,
"timestamp": time.Now().UnixMilli(),
"tags": map[string]string{"provider": provider, "status": status},
},
{
"type": "metric",
"service_id": r.serviceID,
"metric_name": "payment_success_rate",
"value": map[bool]float64{true: 1.0, false: 0.0}[success],
"timestamp": time.Now().UnixMilli(),
"tags": map[string]string{"provider": provider},
},
}
body, _ := json.Marshal(map[string]any{"events": events})
req, _ := http.NewRequest("POST", r.gatewayURL+"/api/v1/telemetry/batch", bytes.NewReader(body))
req.Header.Set("X-API-Key", r.apiKey)
req.Header.Set("Content-Type", "application/json")
http.DefaultClient.Do(req) // fire-and-forget
}
Key metrics to report:
| Metric | Type | What it signals |
|---|---|---|
payment_latency_ms | metric | Slowdowns at provider or internal |
payment_success_rate | metric | Provider degradation or internal errors |
payment_queue_depth | metric | Backpressure / consumer lag |
payment_error_count | metric | Error spikes |
checkout_duration_ms | metric | End-to-end user-facing latency |
| Auth/charge log lines | log | Error context for RCA |
Step 3: Configure Anomaly Detection
The defaults work, but payment services benefit from tighter thresholds:
# payment-service anomaly config
curl -X PUT http://infrasage:8080/api/v1/tenants/payments-team/services/payment-service/config \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"watchdog_z_score_threshold": 2.5,
"cooldown_seconds": 120,
"adaptive_threshold_enabled": true,
"metrics": {
"payment_success_rate": {
"z_score_threshold": 2.0,
"direction": "decrease_only"
},
"payment_latency_ms": {
"z_score_threshold": 2.5,
"direction": "increase_only"
}
}
}'
direction: decrease_only on payment_success_rate means InfraSage only alerts if the rate drops, not if it improves.
Step 4: Configure Slack Alerting
# Set Slack webhook
curl -X PUT http://infrasage:8080/api/v1/tenants/payments-team/integrations/slack \
-H "Authorization: Bearer $JWT" \
-d '{
"webhook_url": "https://hooks.slack.com/services/...",
"channel": "#payment-incidents",
"severity_filter": ["high", "critical"],
"mention_on_critical": "@payments-oncall"
}'
When an anomaly is detected on payment-service, you'll get a Slack message like:
🔴 [CRITICAL] payment-service — payment_success_rate
Anomaly detected: success rate dropped to 78% (Z-score: -3.8, baseline: 99.2%)
RCA in progress... · payments-team · eu-west-1
Step 5: Set Up a Runbook for Auto-Scaling
If payment latency spikes and RCA identifies pod saturation as the likely cause, automatically scale up:
curl -X POST http://infrasage:8080/api/v1/tenants/payments-team/runbooks \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"name": "Scale payment-service on latency spike",
"trigger": {
"service_id": "payment-service",
"metric_name": "payment_latency_ms",
"condition": "anomaly_detected",
"rca_cause_contains": "pod saturation"
},
"steps": [
{
"type": "kubernetes",
"action": "scale_deployment",
"namespace": "payments",
"deployment": "payment-service",
"replicas_delta": 2,
"max_replicas": 10
},
{
"type": "slack",
"message": "Auto-scaled payment-service to {{new_replicas}} replicas due to latency anomaly"
}
],
"approval_required": false,
"dry_run": false
}'
For destructive runbook steps (e.g., pod restarts, config changes), set "approval_required": true to get a Slack approval button before execution.
Step 6: Verify the Pipeline
Send a synthetic spike to test the full pipeline:
# Send normal baseline for 5 minutes
for i in $(seq 1 20); do
curl -s -X POST http://infrasage:8080/api/v1/telemetry \
-H "X-API-Key: $INFRASAGE_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"type\":\"metric\",\"service_id\":\"payment-service\",\"metric_name\":\"payment_latency_ms\",\"value\":$((RANDOM % 30 + 130)),\"timestamp\":$(date +%s000)}"
sleep 15
done
# Send a spike
curl -s -X POST http://infrasage:8080/api/v1/telemetry \
-H "X-API-Key: $INFRASAGE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"type":"metric","service_id":"payment-service","metric_name":"payment_latency_ms","value":980,"timestamp":'$(date +%s000)'}'
Within ~60 seconds (one Watchdog cycle), you should see:
- Anomaly created in the Admin UI
- RCA running (check
/api/v1/anomaliesforstatus: analyzing) - Slack notification in
#payment-incidents - RCA complete with root cause summary
What Good Looks Like
After a week of baseline data, InfraSage's ML Engine builds a seasonal model for your payment metrics. You should see:
- Adaptive thresholds that relax during low-traffic hours (nights/weekends) and tighten during peak hours
- Forecasts showing expected latency ranges for the next 6 hours
- Degradation trends surfacing slow creep in latency before it becomes an incident
Check the ML forecast endpoint:
curl "http://infrasage:8080/api/v1/ml/forecast?service_id=payment-service&metric=payment_latency_ms&horizon_minutes=360" \
-H "X-API-Key: $INFRASAGE_API_KEY"
Related: Anomaly Detection · Root Cause Analysis · Runbooks · Slack Integration