Pre-Onboarding Checklist

This guide covers the work your org should complete before deploying InfraSage. Skipping this phase leads to noisy anomaly detection, poor RCA quality, and unnecessary rework after go-live.

Plan for 1–3 days depending on how much existing instrumentation you have.

Why This Matters

InfraSage's anomaly detection and RCA quality are direct functions of data quality. If your services use inconsistent metric names, missing trace context, or unstructured logs, InfraSage still ingests the data — but the Watchdog can't build accurate baselines, RCA can't build a causal graph, and the signal-to-noise ratio will be low.

Fixing these things before you connect InfraSage is much cheaper than fixing them after.

Phase 1: Audit What You Have

1.1 Inventory your services

Create a list of every service you intend to monitor. For each one, record:

Service	Language	Framework	Current instrumentation	Emits metrics?	Emits traces?	Emits structured logs?
payment-api	Go	gin	Prometheus `/metrics`	✅	❌	✅
auth-service	Python	FastAPI	none	❌	❌	❌
worker	Node.js	—	Datadog agent	✅	✅	✅

This table reveals your biggest gaps before you start any changes.

1.2 Check existing metric names for consistency

Inconsistent naming is the single most common issue. Run this across your Prometheus or existing metrics store:

# List all metric names currently being scraped
curl -s http://your-prometheus/api/v1/label/__name__/values | jq '.data[]' | sort

Look for:

Duplicate concepts under different names (request_duration_ms vs latency_ms vs http_req_duration)
Missing unit suffixes (request_count vs request_count_total, latency vs latency_ms)
Mixed naming conventions (camelCase, kebab-case, snake_case coexisting)

Fix these before ingesting into InfraSage. The ML Engine treats latency_ms and latency as different metrics — it won't correlate them.

1.3 Audit log structure

Pick 5–10 representative log lines from each service. Ask:

Are logs JSON-structured, or raw strings?
Is there a consistent severity / level field with standard values (info, warn, error)?
Is there a service or service_id field in every line?
Do error logs include the error type and a stack trace?
Do logs contain PII (email, user ID, card number, IP addresses)?

# Sample recent logs from a Kubernetes pod
kubectl logs deployment/payment-api --tail=20 -n production

1.4 Check trace context propagation

If your services make HTTP calls to each other, check whether they pass trace context:

# Make a request and look for W3C traceparent header being forwarded
kubectl exec -n production deploy/payment-api -- \
  curl -v http://fraud-service/check 2>&1 | grep -i traceparent

If traceparent is missing, the trace is broken. RCA cannot follow failure paths across service boundaries without this.

Phase 2: Standardize Before You Connect

2.1 Define your service ID scheme

service_id is the primary key for all telemetry in InfraSage. Pick a convention now and enforce it everywhere. You cannot change it without resetting anomaly baselines.

Recommended patterns:

# Simple flat (good for small orgs)
payment-api
auth-service
api-gateway

# Team-prefixed (good for multi-team orgs)
payments/checkout-api
payments/fraud-engine
identity/auth-service
platform/api-gateway

Rules:

Use snake-case or kebab-case consistently — not both
Do not use pod names, container IDs, or deployment hashes (these change per deploy)
The value must match service.name in your OTEL resource attributes

2.2 Standardize metric names

Before connecting InfraSage, rename metrics to a consistent convention. Use snake_case with unit suffix:

# Rename these before connecting
http_req_duration      → http_request_duration_ms
request_count          → http_requests_total
mem_usage              → memory_used_bytes
cpu_pct                → cpu_usage_percent

For Prometheus exporters, metric renames can be done in the OTEL Collector's metricstransform processor (see Phase 3) without touching service code.

Standard metric names InfraSage's ML models recognize across services:

Metric	Expected name	Unit
HTTP request latency	`http_request_duration_ms`	milliseconds
HTTP request rate	`http_requests_total`	count (rate)
HTTP error rate	`http_errors_total`	count (rate)
CPU usage	`cpu_usage_percent`	0–100
Memory usage	`memory_used_bytes`	bytes
GC pause	`gc_pause_ms`	milliseconds
DB query latency	`db_query_duration_ms`	milliseconds
Queue depth	`queue_depth`	count
Cache hit ratio	`cache_hit_ratio`	0–1

Using these exact names enables cross-service ML baseline comparison out of the box.

2.3 Standardize log severity levels

InfraSage maps log severity to anomaly signals. Use these exact values:

Level	When to use
`debug`	Developer-only; strip before production ingestion
`info`	Normal operational events
`warn`	Degraded but not failing
`error`	Operation failed, requires attention
`critical` or `fatal`	Service-level failure

OTEL maps these to SeverityNumber automatically. For services emitting plain text logs, configure the OTEL Collector's severity parser (see Phase 3).

2.4 Switch to structured logging

Services emitting unstructured logs lose most of their RCA value. Switch to JSON logging before onboarding.

Go — zerolog:

import "github.com/rs/zerolog/log"

// Before
log.Printf("payment failed: %v", err)

// After
log.Error().
    Str("service_id", "payment-api").
    Str("trace_id", span.SpanContext().TraceID().String()).
    Str("provider", "stripe").
    Float64("amount_cents", amount).
    Err(err).
    Msg("payment failed")

Python — structlog:

import structlog
log = structlog.get_logger()

# Before
logging.error(f"payment failed: {err}")

# After
log.error("payment_failed",
    service_id="payment-api",
    trace_id=trace_id,
    provider="stripe",
    amount_cents=amount,
    error=str(err))

Node.js — pino:

const logger = require('pino')()

// Before
console.error(`payment failed: ${err.message}`)

// After
logger.error({
  service_id: 'payment-api',
  trace_id: span.spanContext().traceId,
  provider: 'stripe',
  amount_cents: amount,
  err
}, 'payment failed')

Minimum fields every log line should include:

{
  "timestamp": "2026-05-01T09:12:33Z",
  "severity": "error",
  "service_id": "payment-api",
  "message": "payment charge failed",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}

trace_id is what links a log line to the RCA causal graph. Without it, the log is useful context but can't be automatically correlated to the triggering trace.

2.5 Add trace context propagation

Every HTTP client in every service should forward W3C TraceContext headers. This is what enables RCA to follow a failure from the symptom service back to the root cause.

Go — using otelhttp:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

// Wrap your HTTP client
client := http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

// This now automatically injects traceparent into outbound requests
resp, err := client.Get("http://fraud-service/check")

Python — using opentelemetry-instrumentation-requests:

from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

# All requests.get/post calls now propagate trace context automatically
import requests
resp = requests.get("http://fraud-service/check")

Node.js — using opentelemetry-instrumentation-http:

const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')
// Register in your SDK setup — all http/https calls propagate automatically

Verify propagation is working:

# Make a cross-service request and check the downstream service's logs
# Both should show the same trace_id
kubectl logs deploy/payment-api --tail=5 | jq '.trace_id'
kubectl logs deploy/fraud-service --tail=5 | jq '.trace_id'

2.6 Strip PII from logs before ingestion

Identify log fields that contain personal data and strip or hash them at the collector level — before they reach InfraSage's ClickHouse storage.

Common PII fields to strip:

Field	Action
`user_email`, `email`	Drop or hash
`user_id`, `customer_id`	Hash (preserve for correlation without exposing identity)
`ip_address`, `client_ip`	Drop or truncate to /24 subnet
`card_number`, `pan`	Drop — never log
`auth_token`, `api_key`, `password`	Drop — never log
`ssn`, `dob`, `aadhaar_number`	Drop

The OTEL Collector redaction processor handles this automatically (see Phase 3). InfraSage also provides server-side field exclusions as a second layer, but stripping at the collector is cleaner.

Phase 3: Set Up the OTEL Collector

The OTEL Collector is the recommended ingestion path. It decouples your services from InfraSage, handles buffering, batching, and lets you transform data without changing service code.

3.1 Deploy the collector

Deploy one collector per cluster as a DaemonSet (one pod per node) or as a Deployment (centralized):

# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.100.0
          args: ["--config=/etc/otel/config.yaml"]
          ports:
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 4318   # OTLP HTTP
            - containerPort: 8888   # collector metrics
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups: [""]
    resources: ["nodes", "nodes/metrics", "pods", "services", "endpoints"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get"]
  - nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: observability

3.2 Full collector config for InfraSage

This config covers the standard pre-onboarding pipeline: receive from services, apply transforms, scrape Kubernetes, and forward to InfraSage.

# otel-collector-config.yaml
receivers:
  # Receive OTLP from your instrumented services
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape Prometheus endpoints (for services using Prometheus client libraries)
  prometheus:
    config:
      scrape_configs:
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: "true"
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
              target_label: __metrics_path__
              regex: (.+)
            - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
              action: replace
              target_label: service_id
              separator: /
              # Override with app label if present
            - source_labels: [__meta_kubernetes_pod_label_app]
              target_label: service_id
              regex: (.+)

  # Kubernetes cluster events and node metrics
  k8s_cluster:
    collection_interval: 30s
    node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
    allocatable_types_to_report: [cpu, memory]

processors:
  # Batch before sending — reduces HTTP overhead significantly
  batch:
    send_batch_size: 5000
    timeout: 10s

  # Add Kubernetes resource attributes to all telemetry
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name
        - k8s.node.name
        - k8s.container.name
      labels:
        - tag_name: service_id
          key: app
          from: pod
        - tag_name: version
          key: app.kubernetes.io/version
          from: pod

  # Rename metrics to InfraSage standard naming convention
  metricstransform:
    transforms:
      # Common renames — add yours here
      - include: http_request_duration_seconds
        action: update
        new_name: http_request_duration_ms
        operations:
          - action: experimental_scale_value
            experimental_scale: 1000    # convert seconds → ms
      - include: process_resident_memory_bytes
        action: update
        new_name: memory_used_bytes
      - include: process_cpu_seconds_total
        action: update
        new_name: cpu_usage_percent

  # Strip PII from log bodies and attributes
  redaction:
    allow_all_keys: true
    blocked_values:
      # Credit card numbers (PCI)
      - "\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\\b"
      # Email addresses
      - "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"
    summary: debug
  
  # Drop specific attribute keys that contain PII
  attributes/drop_pii:
    actions:
      - key: user.email
        action: delete
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash    # keep for debugging but irreversible

  # Drop debug-level logs before sending to InfraSage
  filter/drop_debug_logs:
    logs:
      exclude:
        match_type: strict
        severity_texts: ["DEBUG", "TRACE"]

  # Resource detection — adds cloud provider, region, cluster info
  resourcedetection:
    detectors: [env, k8s_node, eks, gke, aks]
    timeout: 5s

exporters:
  # Forward everything to InfraSage
  otlphttp/infrasage:
    endpoint: http://infrasage-gateway.infrasage.svc.cluster.local:8080
    headers:
      X-API-Key: "${env:INFRASAGE_API_KEY}"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

  # Keep a local Prometheus endpoint for Grafana (optional — run both in parallel)
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus, k8s_cluster]
      processors: [resourcedetection, k8sattributes, metricstransform, batch]
      exporters: [otlphttp/infrasage, prometheus]
    traces:
      receivers: [otlp]
      processors: [resourcedetection, k8sattributes, batch]
      exporters: [otlphttp/infrasage]
    logs:
      receivers: [otlp]
      processors: [resourcedetection, k8sattributes, redaction, attributes/drop_pii, filter/drop_debug_logs, batch]
      exporters: [otlphttp/infrasage]

3.3 Store the API key as a Secret

kubectl create secret generic otel-collector-secret \
  --from-literal=INFRASAGE_API_KEY=isage_your_key_here \
  -n observability

Reference it in the DaemonSet:

env:
  - name: INFRASAGE_API_KEY
    valueFrom:
      secretKeyRef:
        name: otel-collector-secret
        key: INFRASAGE_API_KEY

3.4 Validate the collector is running

# Check collector health
kubectl port-forward -n observability svc/otel-collector 13133:13133
curl http://localhost:13133/

# Check collector's own metrics
kubectl port-forward -n observability svc/otel-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_exporter_sent

You should see otelcol_exporter_sent_spans, otelcol_exporter_sent_metric_points, and otelcol_exporter_sent_log_records incrementing.

Phase 4: Kubernetes Preparation

4.1 RBAC for InfraSage

InfraSage's AIops Engine needs read access to your cluster to gather RCA evidence (pod state, deployment history, events). Create this before deploying:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: infrasage-reader
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "services", "endpoints", "nodes", "events", "namespaces"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs", "cronjobs"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: infrasage-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: infrasage-reader
subjects:
  - kind: ServiceAccount
    name: infrasage-aiops
    namespace: infrasage

For runbook execution (optional), InfraSage also needs write access to specific resources:

# Only apply if you're enabling runbook automation
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: infrasage-operator
rules:
  - apiGroups: ["apps"]
    resources: ["deployments/scale", "deployments"]
    verbs: ["get", "patch", "update"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["delete"]    # for pod restart runbook actions

4.2 Network policy

Ensure the OTEL Collector and your services can reach the InfraSage Ingestion Gateway:

# Test connectivity from a pod before deploying InfraSage
kubectl run network-test --image=curlimages/curl -it --rm -- \
  curl -v http://infrasage-gateway.infrasage.svc.cluster.local:8080/health

4.3 Namespace and resource quotas

InfraSage components need headroom. Verify the infrasage namespace has sufficient quota:

kubectl describe resourcequota -n infrasage

Recommended minimums for the infrasage namespace:

CPU: 4 cores request, 8 cores limit
Memory: 8Gi request, 16Gi limit
PVC storage: 100Gi (for ClickHouse)

Phase 5: Pre-Flight Checks

Run these before deploying InfraSage.

5.1 Verify OTEL data is flowing

Start the collector pointing at a test endpoint (not InfraSage yet) and confirm telemetry is arriving:

# Temporary debug exporter — logs all data to stdout
exporters:
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

service:
  pipelines:
    metrics:
      exporters: [debug]

Confirm you see your expected service_id values, metric names (in the right format), and no obvious PII in log bodies.

5.2 Cardinality check

High-cardinality metric series will cause ClickHouse storage growth and slow queries. Check your label cardinality before connecting:

# In Prometheus: find metrics with >1000 series
curl -s 'http://your-prometheus/api/v1/query?query=count({__name__=~".+"}) by (__name__)' | \
  jq '.data.result | sort_by(.value[1] | tonumber) | reverse | .[:20]'

Any metric with >1,000 series warrants review. Common culprits: per-user-ID labels, request IDs, trace IDs embedded as labels. Fix these before ingesting.

5.3 Document your service dependency graph

Write down (or generate from existing traces) the upstream/downstream dependencies for your top 10 services. InfraSage builds this automatically from traces, but having it written down lets you validate RCA results in the first week.

payment-api
  → fraud-service (sync, gRPC)
  → ledger-service (sync, HTTP)
  → notification-service (async, Kafka)

fraud-service
  → ml-scoring-service (sync, gRPC)
  → redis-cache (sync)

5.4 Define service tiers

Decide which services are tier-1 (customer-facing, revenue-critical) before you configure anomaly detection sensitivity. Tier-1 services should use tighter thresholds (Z-score 2.5) and never have cooldowns suppressing alerts.

Tier 1 (tightest thresholds, immediate alerts):
  payment-api, api-gateway, auth-service

Tier 2 (standard thresholds):
  fraud-service, ledger-service, user-service

Tier 3 (relaxed thresholds, batch/async):
  notification-service, reporting-service, data-pipeline

Readiness Checklist

Complete this before deploying InfraSage.

Data quality

All services have a stable service_id defined and agreed upon
Metric names follow snake_case with unit suffix
No duplicate metric names for the same measurement across services
All services emit structured (JSON) logs
Every log line includes severity, service_id, and trace_id fields
PII fields are identified and will be stripped at the collector

Tracing

OTEL SDK is initialized in every service (or instrumentation library is registered)
W3C traceparent header is forwarded on all outbound HTTP/gRPC calls
Span status is set to error on exceptions/failures (not just logged)
Trace sampling rate is configured (recommend 100% for low-volume, tail-based for high-volume)

OTEL Collector

Collector DaemonSet (or Deployment) is running in the cluster
Collector is receiving from all instrumented services (check otelcol_receiver_accepted_* metrics)
k8sattributes processor is adding k8s.namespace.name, k8s.deployment.name
Metric renames are applied for any non-standard names
PII redaction processor is active on the logs pipeline
Debug logs are filtered before export
batch processor is configured (reduces HTTP overhead ~10x)

Kubernetes

infrasage-reader ClusterRole and ClusterRoleBinding exist
Network policy allows collector → InfraSage gateway traffic
infrasage namespace has sufficient resource quota

Org

Service tier classification documented (tier 1/2/3)
Service dependency graph documented for top 10 services
On-call contacts identified per service
Anomaly detection sensitivity targets decided per tier

Next step: Deploy InfraSage →

Why This Matters​

Phase 1: Audit What You Have​

1.1 Inventory your services​

1.2 Check existing metric names for consistency​

1.3 Audit log structure​

1.4 Check trace context propagation​

Phase 2: Standardize Before You Connect​

2.1 Define your service ID scheme​

2.2 Standardize metric names​

2.3 Standardize log severity levels​

2.4 Switch to structured logging​

2.5 Add trace context propagation​

2.6 Strip PII from logs before ingestion​

Phase 3: Set Up the OTEL Collector​

3.1 Deploy the collector​

3.2 Full collector config for InfraSage​

3.3 Store the API key as a Secret​

3.4 Validate the collector is running​

Phase 4: Kubernetes Preparation​

4.1 RBAC for InfraSage​

4.2 Network policy​

4.3 Namespace and resource quotas​

Phase 5: Pre-Flight Checks​

5.1 Verify OTEL data is flowing​

5.2 Cardinality check​

5.3 Document your service dependency graph​

5.4 Define service tiers​

Readiness Checklist​