Pre-Onboarding Checklist
This guide covers the work your org should complete before deploying InfraSage. Skipping this phase leads to noisy anomaly detection, poor RCA quality, and unnecessary rework after go-live.
Plan for 1–3 days depending on how much existing instrumentation you have.
Why This Matters
InfraSage's anomaly detection and RCA quality are direct functions of data quality. If your services use inconsistent metric names, missing trace context, or unstructured logs, InfraSage still ingests the data — but the Watchdog can't build accurate baselines, RCA can't build a causal graph, and the signal-to-noise ratio will be low.
Fixing these things before you connect InfraSage is much cheaper than fixing them after.
Phase 1: Audit What You Have
1.1 Inventory your services
Create a list of every service you intend to monitor. For each one, record:
| Service | Language | Framework | Current instrumentation | Emits metrics? | Emits traces? | Emits structured logs? |
|---|---|---|---|---|---|---|
| payment-api | Go | gin | Prometheus /metrics | ✅ | ❌ | ✅ |
| auth-service | Python | FastAPI | none | ❌ | ❌ | ❌ |
| worker | Node.js | — | Datadog agent | ✅ | ✅ | ✅ |
This table reveals your biggest gaps before you start any changes.
1.2 Check existing metric names for consistency
Inconsistent naming is the single most common issue. Run this across your Prometheus or existing metrics store:
# List all metric names currently being scraped
curl -s http://your-prometheus/api/v1/label/__name__/values | jq '.data[]' | sort
Look for:
- Duplicate concepts under different names (
request_duration_msvslatency_msvshttp_req_duration) - Missing unit suffixes (
request_countvsrequest_count_total,latencyvslatency_ms) - Mixed naming conventions (
camelCase,kebab-case,snake_casecoexisting)
Fix these before ingesting into InfraSage. The ML Engine treats latency_ms and latency as different metrics — it won't correlate them.
1.3 Audit log structure
Pick 5–10 representative log lines from each service. Ask:
- Are logs JSON-structured, or raw strings?
- Is there a consistent
severity/levelfield with standard values (info,warn,error)? - Is there a
serviceorservice_idfield in every line? - Do error logs include the error type and a stack trace?
- Do logs contain PII (email, user ID, card number, IP addresses)?
# Sample recent logs from a Kubernetes pod
kubectl logs deployment/payment-api --tail=20 -n production
1.4 Check trace context propagation
If your services make HTTP calls to each other, check whether they pass trace context:
# Make a request and look for W3C traceparent header being forwarded
kubectl exec -n production deploy/payment-api -- \
curl -v http://fraud-service/check 2>&1 | grep -i traceparent
If traceparent is missing, the trace is broken. RCA cannot follow failure paths across service boundaries without this.
Phase 2: Standardize Before You Connect
2.1 Define your service ID scheme
service_id is the primary key for all telemetry in InfraSage. Pick a convention now and enforce it everywhere. You cannot change it without resetting anomaly baselines.
Recommended patterns:
# Simple flat (good for small orgs)
payment-api
auth-service
api-gateway
# Team-prefixed (good for multi-team orgs)
payments/checkout-api
payments/fraud-engine
identity/auth-service
platform/api-gateway
Rules:
- Use
snake-caseorkebab-caseconsistently — not both - Do not use pod names, container IDs, or deployment hashes (these change per deploy)
- The value must match
service.namein your OTEL resource attributes
2.2 Standardize metric names
Before connecting InfraSage, rename metrics to a consistent convention. Use snake_case with unit suffix:
# Rename these before connecting
http_req_duration → http_request_duration_ms
request_count → http_requests_total
mem_usage → memory_used_bytes
cpu_pct → cpu_usage_percent
For Prometheus exporters, metric renames can be done in the OTEL Collector's metricstransform processor (see Phase 3) without touching service code.
Standard metric names InfraSage's ML models recognize across services:
| Metric | Expected name | Unit |
|---|---|---|
| HTTP request latency | http_request_duration_ms | milliseconds |
| HTTP request rate | http_requests_total | count (rate) |
| HTTP error rate | http_errors_total | count (rate) |
| CPU usage | cpu_usage_percent | 0–100 |
| Memory usage | memory_used_bytes | bytes |
| GC pause | gc_pause_ms | milliseconds |
| DB query latency | db_query_duration_ms | milliseconds |
| Queue depth | queue_depth | count |
| Cache hit ratio | cache_hit_ratio | 0–1 |
Using these exact names enables cross-service ML baseline comparison out of the box.
2.3 Standardize log severity levels
InfraSage maps log severity to anomaly signals. Use these exact values:
| Level | When to use |
|---|---|
debug | Developer-only; strip before production ingestion |
info | Normal operational events |
warn | Degraded but not failing |
error | Operation failed, requires attention |
critical or fatal | Service-level failure |
OTEL maps these to SeverityNumber automatically. For services emitting plain text logs, configure the OTEL Collector's severity parser (see Phase 3).
2.4 Switch to structured logging
Services emitting unstructured logs lose most of their RCA value. Switch to JSON logging before onboarding.
Go — zerolog:
import "github.com/rs/zerolog/log"
// Before
log.Printf("payment failed: %v", err)
// After
log.Error().
Str("service_id", "payment-api").
Str("trace_id", span.SpanContext().TraceID().String()).
Str("provider", "stripe").
Float64("amount_cents", amount).
Err(err).
Msg("payment failed")
Python — structlog:
import structlog
log = structlog.get_logger()
# Before
logging.error(f"payment failed: {err}")
# After
log.error("payment_failed",
service_id="payment-api",
trace_id=trace_id,
provider="stripe",
amount_cents=amount,
error=str(err))
Node.js — pino:
const logger = require('pino')()
// Before
console.error(`payment failed: ${err.message}`)
// After
logger.error({
service_id: 'payment-api',
trace_id: span.spanContext().traceId,
provider: 'stripe',
amount_cents: amount,
err
}, 'payment failed')
Minimum fields every log line should include:
{
"timestamp": "2026-05-01T09:12:33Z",
"severity": "error",
"service_id": "payment-api",
"message": "payment charge failed",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}
trace_id is what links a log line to the RCA causal graph. Without it, the log is useful context but can't be automatically correlated to the triggering trace.
2.5 Add trace context propagation
Every HTTP client in every service should forward W3C TraceContext headers. This is what enables RCA to follow a failure from the symptom service back to the root cause.
Go — using otelhttp:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
// Wrap your HTTP client
client := http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
// This now automatically injects traceparent into outbound requests
resp, err := client.Get("http://fraud-service/check")
Python — using opentelemetry-instrumentation-requests:
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
# All requests.get/post calls now propagate trace context automatically
import requests
resp = requests.get("http://fraud-service/check")
Node.js — using opentelemetry-instrumentation-http:
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')
// Register in your SDK setup — all http/https calls propagate automatically
Verify propagation is working:
# Make a cross-service request and check the downstream service's logs
# Both should show the same trace_id
kubectl logs deploy/payment-api --tail=5 | jq '.trace_id'
kubectl logs deploy/fraud-service --tail=5 | jq '.trace_id'
2.6 Strip PII from logs before ingestion
Identify log fields that contain personal data and strip or hash them at the collector level — before they reach InfraSage's ClickHouse storage.
Common PII fields to strip:
| Field | Action |
|---|---|
user_email, email | Drop or hash |
user_id, customer_id | Hash (preserve for correlation without exposing identity) |
ip_address, client_ip | Drop or truncate to /24 subnet |
card_number, pan | Drop — never log |
auth_token, api_key, password | Drop — never log |
ssn, dob, aadhaar_number | Drop |
The OTEL Collector redaction processor handles this automatically (see Phase 3). InfraSage also provides server-side field exclusions as a second layer, but stripping at the collector is cleaner.
Phase 3: Set Up the OTEL Collector
The OTEL Collector is the recommended ingestion path. It decouples your services from InfraSage, handles buffering, batching, and lets you transform data without changing service code.
3.1 Deploy the collector
Deploy one collector per cluster as a DaemonSet (one pod per node) or as a Deployment (centralized):
# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.100.0
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # collector metrics
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy"]
verbs: ["get"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: observability
3.2 Full collector config for InfraSage
This config covers the standard pre-onboarding pipeline: receive from services, apply transforms, scrape Kubernetes, and forward to InfraSage.
# otel-collector-config.yaml
receivers:
# Receive OTLP from your instrumented services
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Scrape Prometheus endpoints (for services using Prometheus client libraries)
prometheus:
config:
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
action: replace
target_label: service_id
separator: /
# Override with app label if present
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: service_id
regex: (.+)
# Kubernetes cluster events and node metrics
k8s_cluster:
collection_interval: 30s
node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
allocatable_types_to_report: [cpu, memory]
processors:
# Batch before sending — reduces HTTP overhead significantly
batch:
send_batch_size: 5000
timeout: 10s
# Add Kubernetes resource attributes to all telemetry
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.node.name
- k8s.container.name
labels:
- tag_name: service_id
key: app
from: pod
- tag_name: version
key: app.kubernetes.io/version
from: pod
# Rename metrics to InfraSage standard naming convention
metricstransform:
transforms:
# Common renames — add yours here
- include: http_request_duration_seconds
action: update
new_name: http_request_duration_ms
operations:
- action: experimental_scale_value
experimental_scale: 1000 # convert seconds → ms
- include: process_resident_memory_bytes
action: update
new_name: memory_used_bytes
- include: process_cpu_seconds_total
action: update
new_name: cpu_usage_percent
# Strip PII from log bodies and attributes
redaction:
allow_all_keys: true
blocked_values:
# Credit card numbers (PCI)
- "\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\\b"
# Email addresses
- "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"
summary: debug
# Drop specific attribute keys that contain PII
attributes/drop_pii:
actions:
- key: user.email
action: delete
- key: http.request.header.authorization
action: delete
- key: db.statement
action: hash # keep for debugging but irreversible
# Drop debug-level logs before sending to InfraSage
filter/drop_debug_logs:
logs:
exclude:
match_type: strict
severity_texts: ["DEBUG", "TRACE"]
# Resource detection — adds cloud provider, region, cluster info
resourcedetection:
detectors: [env, k8s_node, eks, gke, aks]
timeout: 5s
exporters:
# Forward everything to InfraSage
otlphttp/infrasage:
endpoint: http://infrasage-gateway.infrasage.svc.cluster.local:8080
headers:
X-API-Key: "${env:INFRASAGE_API_KEY}"
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
# Keep a local Prometheus endpoint for Grafana (optional — run both in parallel)
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
metrics:
receivers: [otlp, prometheus, k8s_cluster]
processors: [resourcedetection, k8sattributes, metricstransform, batch]
exporters: [otlphttp/infrasage, prometheus]
traces:
receivers: [otlp]
processors: [resourcedetection, k8sattributes, batch]
exporters: [otlphttp/infrasage]
logs:
receivers: [otlp]
processors: [resourcedetection, k8sattributes, redaction, attributes/drop_pii, filter/drop_debug_logs, batch]
exporters: [otlphttp/infrasage]
3.3 Store the API key as a Secret
kubectl create secret generic otel-collector-secret \
--from-literal=INFRASAGE_API_KEY=isage_your_key_here \
-n observability
Reference it in the DaemonSet:
env:
- name: INFRASAGE_API_KEY
valueFrom:
secretKeyRef:
name: otel-collector-secret
key: INFRASAGE_API_KEY
3.4 Validate the collector is running
# Check collector health
kubectl port-forward -n observability svc/otel-collector 13133:13133
curl http://localhost:13133/
# Check collector's own metrics
kubectl port-forward -n observability svc/otel-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_exporter_sent
You should see otelcol_exporter_sent_spans, otelcol_exporter_sent_metric_points, and otelcol_exporter_sent_log_records incrementing.
Phase 4: Kubernetes Preparation
4.1 RBAC for InfraSage
InfraSage's AIops Engine needs read access to your cluster to gather RCA evidence (pod state, deployment history, events). Create this before deploying:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: infrasage-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "endpoints", "nodes", "events", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: infrasage-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: infrasage-reader
subjects:
- kind: ServiceAccount
name: infrasage-aiops
namespace: infrasage
For runbook execution (optional), InfraSage also needs write access to specific resources:
# Only apply if you're enabling runbook automation
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: infrasage-operator
rules:
- apiGroups: ["apps"]
resources: ["deployments/scale", "deployments"]
verbs: ["get", "patch", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete"] # for pod restart runbook actions
4.2 Network policy
Ensure the OTEL Collector and your services can reach the InfraSage Ingestion Gateway:
# Test connectivity from a pod before deploying InfraSage
kubectl run network-test --image=curlimages/curl -it --rm -- \
curl -v http://infrasage-gateway.infrasage.svc.cluster.local:8080/health
4.3 Namespace and resource quotas
InfraSage components need headroom. Verify the infrasage namespace has sufficient quota:
kubectl describe resourcequota -n infrasage
Recommended minimums for the infrasage namespace:
- CPU: 4 cores request, 8 cores limit
- Memory: 8Gi request, 16Gi limit
- PVC storage: 100Gi (for ClickHouse)
Phase 5: Pre-Flight Checks
Run these before deploying InfraSage.
5.1 Verify OTEL data is flowing
Start the collector pointing at a test endpoint (not InfraSage yet) and confirm telemetry is arriving:
# Temporary debug exporter — logs all data to stdout
exporters:
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
metrics:
exporters: [debug]
Confirm you see your expected service_id values, metric names (in the right format), and no obvious PII in log bodies.
5.2 Cardinality check
High-cardinality metric series will cause ClickHouse storage growth and slow queries. Check your label cardinality before connecting:
# In Prometheus: find metrics with >1000 series
curl -s 'http://your-prometheus/api/v1/query?query=count({__name__=~".+"}) by (__name__)' | \
jq '.data.result | sort_by(.value[1] | tonumber) | reverse | .[:20]'
Any metric with >1,000 series warrants review. Common culprits: per-user-ID labels, request IDs, trace IDs embedded as labels. Fix these before ingesting.
5.3 Document your service dependency graph
Write down (or generate from existing traces) the upstream/downstream dependencies for your top 10 services. InfraSage builds this automatically from traces, but having it written down lets you validate RCA results in the first week.
payment-api
→ fraud-service (sync, gRPC)
→ ledger-service (sync, HTTP)
→ notification-service (async, Kafka)
fraud-service
→ ml-scoring-service (sync, gRPC)
→ redis-cache (sync)
5.4 Define service tiers
Decide which services are tier-1 (customer-facing, revenue-critical) before you configure anomaly detection sensitivity. Tier-1 services should use tighter thresholds (Z-score 2.5) and never have cooldowns suppressing alerts.
Tier 1 (tightest thresholds, immediate alerts):
payment-api, api-gateway, auth-service
Tier 2 (standard thresholds):
fraud-service, ledger-service, user-service
Tier 3 (relaxed thresholds, batch/async):
notification-service, reporting-service, data-pipeline
Readiness Checklist
Complete this before deploying InfraSage.
Data quality
- All services have a stable
service_iddefined and agreed upon - Metric names follow
snake_casewith unit suffix - No duplicate metric names for the same measurement across services
- All services emit structured (JSON) logs
- Every log line includes
severity,service_id, andtrace_idfields - PII fields are identified and will be stripped at the collector
Tracing
- OTEL SDK is initialized in every service (or instrumentation library is registered)
- W3C
traceparentheader is forwarded on all outbound HTTP/gRPC calls - Span status is set to
erroron exceptions/failures (not just logged) - Trace sampling rate is configured (recommend 100% for low-volume, tail-based for high-volume)
OTEL Collector
- Collector DaemonSet (or Deployment) is running in the cluster
- Collector is receiving from all instrumented services (check
otelcol_receiver_accepted_*metrics) -
k8sattributesprocessor is addingk8s.namespace.name,k8s.deployment.name - Metric renames are applied for any non-standard names
- PII redaction processor is active on the logs pipeline
- Debug logs are filtered before export
-
batchprocessor is configured (reduces HTTP overhead ~10x)
Kubernetes
-
infrasage-readerClusterRole and ClusterRoleBinding exist - Network policy allows collector → InfraSage gateway traffic
-
infrasagenamespace has sufficient resource quota
Org
- Service tier classification documented (tier 1/2/3)
- Service dependency graph documented for top 10 services
- On-call contacts identified per service
- Anomaly detection sensitivity targets decided per tier
Next step: Deploy InfraSage →