Skip to main content

Best Practices

Guidance on naming, cardinality, batching, and organizing InfraSage for teams operating at scale.


Metric Naming

Consistent metric names make dashboards reusable and anomaly detection more accurate (the ML Engine learns across services with the same metric names).

Convention

Use snake_case. Follow the pattern: <subject>_<measurement>_<unit>.

payment_latency_ms ✅
payment-latency-ms ❌ (hyphens)
PaymentLatencyMs ❌ (camelCase)
latency ❌ (too generic — which latency?)
http_request_duration_ms ✅
http_req_dur ❌ (abbreviated — hard to search)

Units

Always include the unit as a suffix. Never embed units in tags.

MeasurementSuffix
Milliseconds_ms
Seconds_seconds
Bytes_bytes
Percentage (0–100)_percent
Ratio (0–1)_ratio
Count (absolute)_total
Count (rate, per-second)_per_second
memory_used_bytes ✅
memory_used_mb ❌ (ambiguous; convert to bytes)
success_rate ❌ (ambiguous; use success_ratio or success_percent)
success_ratio ✅ (0.0–1.0)

Avoid Metric Proliferation

Don't create a separate metric for each HTTP status code:

# Bad: explodes cardinality
http_200_count
http_404_count
http_500_count

# Good: use a tag
http_request_count {status_code: "200"}
http_request_count {status_code: "500"}

Tag Cardinality

High-cardinality tags are the most common source of unexpected storage growth and slow queries.

Safe (Low-Cardinality) Tags

Tags with a small, bounded set of values:

region: "eu-west-1", "ap-south-1", "us-east-1"
environment: "prod", "staging", "dev"
version: "v2.3.1", "v2.3.2" (bounded by release cadence)
status_code: "200", "404", "500"
provider: "stripe", "razorpay"

Unsafe (High-Cardinality) Tags

Tags with unbounded or large value sets:

user_id: "u_abc123" ❌ millions of distinct values
request_id: "req_xyz" ❌ unique per request
trace_id: "..." ❌ use traces instead of embedding in metrics
session_id: "..." ❌

Rule of Thumb

A tag is safe if it has fewer than ~1,000 distinct values. If a tag value is unique per event, it belongs in the log body or trace payload, not a metric tag.


Service ID Conventions

service_id is the primary dimension for all telemetry. Use a consistent scheme across your org.

For a simple organization:

payment-service
auth-service
api-gateway
worker-queue

For large orgs with multiple teams:

payments/checkout-api
payments/fraud-engine
identity/auth-service
platform/api-gateway

The /-delimited prefix enables team-scoped queries and RBAC:

-- All payments team anomalies
SELECT * FROM anomalies WHERE service_id LIKE 'payments/%'

Ephemeral vs. Stable Service IDs

Use stable service IDs, not pod names or container IDs. The ML Engine builds baselines per service_id — if IDs change on every deploy, baselines reset.

# Bad: pod name as service_id
payment-service-7f9d4b8c6-xk2rp

# Good: stable logical name
payment-service

Batch Sizing

Sending events individually (one HTTP call per event) has high overhead. Use the batch endpoint.

ScenarioBatch sizeFlush interval
High-throughput service (>1K req/s)500–1000 events1–5 seconds
Medium-throughput service100–500 events5–15 seconds
Low-throughput service / background jobs50–100 events15–60 seconds
One-shot scriptsAll eventsOn exit

Don't Over-Buffer

Large buffers increase data loss on crash. For financial services, use smaller batches with more frequent flushes:

# For payment services: small batches, short flush
sender = BufferedSender(client, max_size=100, flush_interval_ms=5000)

Always Flush on Shutdown

import atexit
atexit.register(sender.flush)
defer sender.Flush()

Structuring Logs for RCA

RCA uses log lines as evidence. Structured logs with consistent fields give significantly better RCA quality.

Include in Every Log Line

{
"timestamp": "2026-04-30T14:32:11Z",
"service_id": "payment-service",
"severity": "error",
"message": "Payment charge failed",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"error": "upstream timeout after 3000ms",
"provider": "stripe",
"amount_cents": 4999,
"currency": "EUR"
}

Don't Include in Logs

{
"user_email": "[email protected]", ❌ PII — strip before ingestion
"card_number": "4111...", ❌ PCI data — never log
"raw_request_body": "...", ❌ too large, often contains PII
}

Configure field exclusions in InfraSage:

LOG_FIELD_EXCLUSIONS=user_email,card_number,password,token,secret

Organizing Tenants

Each team owns their API keys, RBAC, and alert configuration independently. The platform team holds the Super-Admin role.

tenant: payments-team → owns payment-service, fraud-service
tenant: identity-team → owns auth-service, session-service
tenant: platform-team → owns api-gateway, k8s infrastructure

One Tenant Per Environment

Alternatively, isolate by environment:

tenant: prod → all production services
tenant: staging → all staging services

Choose team-based tenants if you want independent billing/limits per team. Choose environment-based tenants if you want a single unified view per environment.

Don't Mix Production and Non-Production in One Tenant

Anomaly baselines are per (tenant, service_id). If staging services share a tenant with prod, noisy staging deployments contaminate prod baselines.


Multi-Service Correlation

To get the most out of RCA, ensure dependencies between services are visible to InfraSage via traces. A trace links parent and child spans across services, enabling the causal graph to follow failure paths.

Minimum trace requirements for good RCA:

  • Every outbound HTTP call creates a child span with service_id of the downstream service
  • Span status is set to error on failure (not just a log line)
  • trace_id is propagated via traceparent header (W3C Trace Context)

See OpenTelemetry Integration for setup.