Skip to main content

Pre-Onboarding Checklist

This guide covers the work your org should complete before deploying InfraSage. Skipping this phase leads to noisy anomaly detection, poor RCA quality, and unnecessary rework after go-live.

Plan for 1–3 days depending on how much existing instrumentation you have.


Why This Matters

InfraSage's anomaly detection and RCA quality are direct functions of data quality. If your services use inconsistent metric names, missing trace context, or unstructured logs, InfraSage still ingests the data — but the Watchdog can't build accurate baselines, RCA can't build a causal graph, and the signal-to-noise ratio will be low.

Fixing these things before you connect InfraSage is much cheaper than fixing them after.


Phase 1: Audit What You Have

1.1 Inventory your services

Create a list of every service you intend to monitor. For each one, record:

ServiceLanguageFrameworkCurrent instrumentationEmits metrics?Emits traces?Emits structured logs?
payment-apiGoginPrometheus /metrics
auth-servicePythonFastAPInone
workerNode.jsDatadog agent

This table reveals your biggest gaps before you start any changes.

1.2 Check existing metric names for consistency

Inconsistent naming is the single most common issue. Run this across your Prometheus or existing metrics store:

# List all metric names currently being scraped
curl -s http://your-prometheus/api/v1/label/__name__/values | jq '.data[]' | sort

Look for:

  • Duplicate concepts under different names (request_duration_ms vs latency_ms vs http_req_duration)
  • Missing unit suffixes (request_count vs request_count_total, latency vs latency_ms)
  • Mixed naming conventions (camelCase, kebab-case, snake_case coexisting)

Fix these before ingesting into InfraSage. The ML Engine treats latency_ms and latency as different metrics — it won't correlate them.

1.3 Audit log structure

Pick 5–10 representative log lines from each service. Ask:

  • Are logs JSON-structured, or raw strings?
  • Is there a consistent severity / level field with standard values (info, warn, error)?
  • Is there a service or service_id field in every line?
  • Do error logs include the error type and a stack trace?
  • Do logs contain PII (email, user ID, card number, IP addresses)?
# Sample recent logs from a Kubernetes pod
kubectl logs deployment/payment-api --tail=20 -n production

1.4 Check trace context propagation

If your services make HTTP calls to each other, check whether they pass trace context:

# Make a request and look for W3C traceparent header being forwarded
kubectl exec -n production deploy/payment-api -- \
curl -v http://fraud-service/check 2>&1 | grep -i traceparent

If traceparent is missing, the trace is broken. RCA cannot follow failure paths across service boundaries without this.


Phase 2: Standardize Before You Connect

2.1 Define your service ID scheme

service_id is the primary key for all telemetry in InfraSage. Pick a convention now and enforce it everywhere. You cannot change it without resetting anomaly baselines.

Recommended patterns:

# Simple flat (good for small orgs)
payment-api
auth-service
api-gateway

# Team-prefixed (good for multi-team orgs)
payments/checkout-api
payments/fraud-engine
identity/auth-service
platform/api-gateway

Rules:

  • Use snake-case or kebab-case consistently — not both
  • Do not use pod names, container IDs, or deployment hashes (these change per deploy)
  • The value must match service.name in your OTEL resource attributes

2.2 Standardize metric names

Before connecting InfraSage, rename metrics to a consistent convention. Use snake_case with unit suffix:

# Rename these before connecting
http_req_duration → http_request_duration_ms
request_count → http_requests_total
mem_usage → memory_used_bytes
cpu_pct → cpu_usage_percent

For Prometheus exporters, metric renames can be done in the OTEL Collector's metricstransform processor (see Phase 3) without touching service code.

Standard metric names InfraSage's ML models recognize across services:

MetricExpected nameUnit
HTTP request latencyhttp_request_duration_msmilliseconds
HTTP request ratehttp_requests_totalcount (rate)
HTTP error ratehttp_errors_totalcount (rate)
CPU usagecpu_usage_percent0–100
Memory usagememory_used_bytesbytes
GC pausegc_pause_msmilliseconds
DB query latencydb_query_duration_msmilliseconds
Queue depthqueue_depthcount
Cache hit ratiocache_hit_ratio0–1

Using these exact names enables cross-service ML baseline comparison out of the box.

2.3 Standardize log severity levels

InfraSage maps log severity to anomaly signals. Use these exact values:

LevelWhen to use
debugDeveloper-only; strip before production ingestion
infoNormal operational events
warnDegraded but not failing
errorOperation failed, requires attention
critical or fatalService-level failure

OTEL maps these to SeverityNumber automatically. For services emitting plain text logs, configure the OTEL Collector's severity parser (see Phase 3).

2.4 Switch to structured logging

Services emitting unstructured logs lose most of their RCA value. Switch to JSON logging before onboarding.

Go — zerolog:

import "github.com/rs/zerolog/log"

// Before
log.Printf("payment failed: %v", err)

// After
log.Error().
Str("service_id", "payment-api").
Str("trace_id", span.SpanContext().TraceID().String()).
Str("provider", "stripe").
Float64("amount_cents", amount).
Err(err).
Msg("payment failed")

Python — structlog:

import structlog
log = structlog.get_logger()

# Before
logging.error(f"payment failed: {err}")

# After
log.error("payment_failed",
service_id="payment-api",
trace_id=trace_id,
provider="stripe",
amount_cents=amount,
error=str(err))

Node.js — pino:

const logger = require('pino')()

// Before
console.error(`payment failed: ${err.message}`)

// After
logger.error({
service_id: 'payment-api',
trace_id: span.spanContext().traceId,
provider: 'stripe',
amount_cents: amount,
err
}, 'payment failed')

Minimum fields every log line should include:

{
"timestamp": "2026-05-01T09:12:33Z",
"severity": "error",
"service_id": "payment-api",
"message": "payment charge failed",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}

trace_id is what links a log line to the RCA causal graph. Without it, the log is useful context but can't be automatically correlated to the triggering trace.

2.5 Add trace context propagation

Every HTTP client in every service should forward W3C TraceContext headers. This is what enables RCA to follow a failure from the symptom service back to the root cause.

Go — using otelhttp:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

// Wrap your HTTP client
client := http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}

// This now automatically injects traceparent into outbound requests
resp, err := client.Get("http://fraud-service/check")

Python — using opentelemetry-instrumentation-requests:

from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

# All requests.get/post calls now propagate trace context automatically
import requests
resp = requests.get("http://fraud-service/check")

Node.js — using opentelemetry-instrumentation-http:

const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')
// Register in your SDK setup — all http/https calls propagate automatically

Verify propagation is working:

# Make a cross-service request and check the downstream service's logs
# Both should show the same trace_id
kubectl logs deploy/payment-api --tail=5 | jq '.trace_id'
kubectl logs deploy/fraud-service --tail=5 | jq '.trace_id'

2.6 Strip PII from logs before ingestion

Identify log fields that contain personal data and strip or hash them at the collector level — before they reach InfraSage's ClickHouse storage.

Common PII fields to strip:

FieldAction
user_email, emailDrop or hash
user_id, customer_idHash (preserve for correlation without exposing identity)
ip_address, client_ipDrop or truncate to /24 subnet
card_number, panDrop — never log
auth_token, api_key, passwordDrop — never log
ssn, dob, aadhaar_numberDrop

The OTEL Collector redaction processor handles this automatically (see Phase 3). InfraSage also provides server-side field exclusions as a second layer, but stripping at the collector is cleaner.


Phase 3: Set Up the OTEL Collector

The OTEL Collector is the recommended ingestion path. It decouples your services from InfraSage, handles buffering, batching, and lets you transform data without changing service code.

3.1 Deploy the collector

Deploy one collector per cluster as a DaemonSet (one pod per node) or as a Deployment (centralized):

# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.100.0
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # collector metrics
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy"]
verbs: ["get"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: observability

3.2 Full collector config for InfraSage

This config covers the standard pre-onboarding pipeline: receive from services, apply transforms, scrape Kubernetes, and forward to InfraSage.

# otel-collector-config.yaml
receivers:
# Receive OTLP from your instrumented services
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

# Scrape Prometheus endpoints (for services using Prometheus client libraries)
prometheus:
config:
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
action: replace
target_label: service_id
separator: /
# Override with app label if present
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: service_id
regex: (.+)

# Kubernetes cluster events and node metrics
k8s_cluster:
collection_interval: 30s
node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
allocatable_types_to_report: [cpu, memory]

processors:
# Batch before sending — reduces HTTP overhead significantly
batch:
send_batch_size: 5000
timeout: 10s

# Add Kubernetes resource attributes to all telemetry
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.node.name
- k8s.container.name
labels:
- tag_name: service_id
key: app
from: pod
- tag_name: version
key: app.kubernetes.io/version
from: pod

# Rename metrics to InfraSage standard naming convention
metricstransform:
transforms:
# Common renames — add yours here
- include: http_request_duration_seconds
action: update
new_name: http_request_duration_ms
operations:
- action: experimental_scale_value
experimental_scale: 1000 # convert seconds → ms
- include: process_resident_memory_bytes
action: update
new_name: memory_used_bytes
- include: process_cpu_seconds_total
action: update
new_name: cpu_usage_percent

# Strip PII from log bodies and attributes
redaction:
allow_all_keys: true
blocked_values:
# Credit card numbers (PCI)
- "\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\\b"
# Email addresses
- "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"
summary: debug

# Drop specific attribute keys that contain PII
attributes/drop_pii:
actions:
- key: user.email
action: delete
- key: http.request.header.authorization
action: delete
- key: db.statement
action: hash # keep for debugging but irreversible

# Drop debug-level logs before sending to InfraSage
filter/drop_debug_logs:
logs:
exclude:
match_type: strict
severity_texts: ["DEBUG", "TRACE"]

# Resource detection — adds cloud provider, region, cluster info
resourcedetection:
detectors: [env, k8s_node, eks, gke, aks]
timeout: 5s

exporters:
# Forward everything to InfraSage
otlphttp/infrasage:
endpoint: http://infrasage-gateway.infrasage.svc.cluster.local:8080
headers:
X-API-Key: "${env:INFRASAGE_API_KEY}"
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000

# Keep a local Prometheus endpoint for Grafana (optional — run both in parallel)
prometheus:
endpoint: "0.0.0.0:8889"

service:
pipelines:
metrics:
receivers: [otlp, prometheus, k8s_cluster]
processors: [resourcedetection, k8sattributes, metricstransform, batch]
exporters: [otlphttp/infrasage, prometheus]
traces:
receivers: [otlp]
processors: [resourcedetection, k8sattributes, batch]
exporters: [otlphttp/infrasage]
logs:
receivers: [otlp]
processors: [resourcedetection, k8sattributes, redaction, attributes/drop_pii, filter/drop_debug_logs, batch]
exporters: [otlphttp/infrasage]

3.3 Store the API key as a Secret

kubectl create secret generic otel-collector-secret \
--from-literal=INFRASAGE_API_KEY=isage_your_key_here \
-n observability

Reference it in the DaemonSet:

env:
- name: INFRASAGE_API_KEY
valueFrom:
secretKeyRef:
name: otel-collector-secret
key: INFRASAGE_API_KEY

3.4 Validate the collector is running

# Check collector health
kubectl port-forward -n observability svc/otel-collector 13133:13133
curl http://localhost:13133/

# Check collector's own metrics
kubectl port-forward -n observability svc/otel-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_exporter_sent

You should see otelcol_exporter_sent_spans, otelcol_exporter_sent_metric_points, and otelcol_exporter_sent_log_records incrementing.


Phase 4: Kubernetes Preparation

4.1 RBAC for InfraSage

InfraSage's AIops Engine needs read access to your cluster to gather RCA evidence (pod state, deployment history, events). Create this before deploying:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: infrasage-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "endpoints", "nodes", "events", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: infrasage-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: infrasage-reader
subjects:
- kind: ServiceAccount
name: infrasage-aiops
namespace: infrasage

For runbook execution (optional), InfraSage also needs write access to specific resources:

# Only apply if you're enabling runbook automation
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: infrasage-operator
rules:
- apiGroups: ["apps"]
resources: ["deployments/scale", "deployments"]
verbs: ["get", "patch", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete"] # for pod restart runbook actions

4.2 Network policy

Ensure the OTEL Collector and your services can reach the InfraSage Ingestion Gateway:

# Test connectivity from a pod before deploying InfraSage
kubectl run network-test --image=curlimages/curl -it --rm -- \
curl -v http://infrasage-gateway.infrasage.svc.cluster.local:8080/health

4.3 Namespace and resource quotas

InfraSage components need headroom. Verify the infrasage namespace has sufficient quota:

kubectl describe resourcequota -n infrasage

Recommended minimums for the infrasage namespace:

  • CPU: 4 cores request, 8 cores limit
  • Memory: 8Gi request, 16Gi limit
  • PVC storage: 100Gi (for ClickHouse)

Phase 5: Pre-Flight Checks

Run these before deploying InfraSage.

5.1 Verify OTEL data is flowing

Start the collector pointing at a test endpoint (not InfraSage yet) and confirm telemetry is arriving:

# Temporary debug exporter — logs all data to stdout
exporters:
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200

service:
pipelines:
metrics:
exporters: [debug]

Confirm you see your expected service_id values, metric names (in the right format), and no obvious PII in log bodies.

5.2 Cardinality check

High-cardinality metric series will cause ClickHouse storage growth and slow queries. Check your label cardinality before connecting:

# In Prometheus: find metrics with >1000 series
curl -s 'http://your-prometheus/api/v1/query?query=count({__name__=~".+"}) by (__name__)' | \
jq '.data.result | sort_by(.value[1] | tonumber) | reverse | .[:20]'

Any metric with >1,000 series warrants review. Common culprits: per-user-ID labels, request IDs, trace IDs embedded as labels. Fix these before ingesting.

5.3 Document your service dependency graph

Write down (or generate from existing traces) the upstream/downstream dependencies for your top 10 services. InfraSage builds this automatically from traces, but having it written down lets you validate RCA results in the first week.

payment-api
→ fraud-service (sync, gRPC)
→ ledger-service (sync, HTTP)
→ notification-service (async, Kafka)

fraud-service
→ ml-scoring-service (sync, gRPC)
→ redis-cache (sync)

5.4 Define service tiers

Decide which services are tier-1 (customer-facing, revenue-critical) before you configure anomaly detection sensitivity. Tier-1 services should use tighter thresholds (Z-score 2.5) and never have cooldowns suppressing alerts.

Tier 1 (tightest thresholds, immediate alerts):
payment-api, api-gateway, auth-service

Tier 2 (standard thresholds):
fraud-service, ledger-service, user-service

Tier 3 (relaxed thresholds, batch/async):
notification-service, reporting-service, data-pipeline

Readiness Checklist

Complete this before deploying InfraSage.

Data quality

  • All services have a stable service_id defined and agreed upon
  • Metric names follow snake_case with unit suffix
  • No duplicate metric names for the same measurement across services
  • All services emit structured (JSON) logs
  • Every log line includes severity, service_id, and trace_id fields
  • PII fields are identified and will be stripped at the collector

Tracing

  • OTEL SDK is initialized in every service (or instrumentation library is registered)
  • W3C traceparent header is forwarded on all outbound HTTP/gRPC calls
  • Span status is set to error on exceptions/failures (not just logged)
  • Trace sampling rate is configured (recommend 100% for low-volume, tail-based for high-volume)

OTEL Collector

  • Collector DaemonSet (or Deployment) is running in the cluster
  • Collector is receiving from all instrumented services (check otelcol_receiver_accepted_* metrics)
  • k8sattributes processor is adding k8s.namespace.name, k8s.deployment.name
  • Metric renames are applied for any non-standard names
  • PII redaction processor is active on the logs pipeline
  • Debug logs are filtered before export
  • batch processor is configured (reduces HTTP overhead ~10x)

Kubernetes

  • infrasage-reader ClusterRole and ClusterRoleBinding exist
  • Network policy allows collector → InfraSage gateway traffic
  • infrasage namespace has sufficient resource quota

Org

  • Service tier classification documented (tier 1/2/3)
  • Service dependency graph documented for top 10 services
  • On-call contacts identified per service
  • Anomaly detection sensitivity targets decided per tier

Next step: Deploy InfraSage →