Best Practices
Guidance on naming, cardinality, batching, and organizing InfraSage for teams operating at scale.
Metric Naming
Consistent metric names make dashboards reusable and anomaly detection more accurate (the ML Engine learns across services with the same metric names).
Convention
Use snake_case. Follow the pattern: <subject>_<measurement>_<unit>.
payment_latency_ms ✅
payment-latency-ms ❌ (hyphens)
PaymentLatencyMs ❌ (camelCase)
latency ❌ (too generic — which latency?)
http_request_duration_ms ✅
http_req_dur ❌ (abbreviated — hard to search)
Units
Always include the unit as a suffix. Never embed units in tags.
| Measurement | Suffix |
|---|---|
| Milliseconds | _ms |
| Seconds | _seconds |
| Bytes | _bytes |
| Percentage (0–100) | _percent |
| Ratio (0–1) | _ratio |
| Count (absolute) | _total |
| Count (rate, per-second) | _per_second |
memory_used_bytes ✅
memory_used_mb ❌ (ambiguous; convert to bytes)
success_rate ❌ (ambiguous; use success_ratio or success_percent)
success_ratio ✅ (0.0–1.0)
Avoid Metric Proliferation
Don't create a separate metric for each HTTP status code:
# Bad: explodes cardinality
http_200_count
http_404_count
http_500_count
# Good: use a tag
http_request_count {status_code: "200"}
http_request_count {status_code: "500"}
Tag Cardinality
High-cardinality tags are the most common source of unexpected storage growth and slow queries.
Safe (Low-Cardinality) Tags
Tags with a small, bounded set of values:
region: "eu-west-1", "ap-south-1", "us-east-1"
environment: "prod", "staging", "dev"
version: "v2.3.1", "v2.3.2" (bounded by release cadence)
status_code: "200", "404", "500"
provider: "stripe", "razorpay"
Unsafe (High-Cardinality) Tags
Tags with unbounded or large value sets:
user_id: "u_abc123" ❌ millions of distinct values
request_id: "req_xyz" ❌ unique per request
trace_id: "..." ❌ use traces instead of embedding in metrics
session_id: "..." ❌
Rule of Thumb
A tag is safe if it has fewer than ~1,000 distinct values. If a tag value is unique per event, it belongs in the log body or trace payload, not a metric tag.
Service ID Conventions
service_id is the primary dimension for all telemetry. Use a consistent scheme across your org.
Recommended Patterns
For a simple organization:
payment-service
auth-service
api-gateway
worker-queue
For large orgs with multiple teams:
payments/checkout-api
payments/fraud-engine
identity/auth-service
platform/api-gateway
The /-delimited prefix enables team-scoped queries and RBAC:
-- All payments team anomalies
SELECT * FROM anomalies WHERE service_id LIKE 'payments/%'
Ephemeral vs. Stable Service IDs
Use stable service IDs, not pod names or container IDs. The ML Engine builds baselines per service_id — if IDs change on every deploy, baselines reset.
# Bad: pod name as service_id
payment-service-7f9d4b8c6-xk2rp
# Good: stable logical name
payment-service
Batch Sizing
Sending events individually (one HTTP call per event) has high overhead. Use the batch endpoint.
Recommended Batch Configuration
| Scenario | Batch size | Flush interval |
|---|---|---|
| High-throughput service (>1K req/s) | 500–1000 events | 1–5 seconds |
| Medium-throughput service | 100–500 events | 5–15 seconds |
| Low-throughput service / background jobs | 50–100 events | 15–60 seconds |
| One-shot scripts | All events | On exit |
Don't Over-Buffer
Large buffers increase data loss on crash. For financial services, use smaller batches with more frequent flushes:
# For payment services: small batches, short flush
sender = BufferedSender(client, max_size=100, flush_interval_ms=5000)
Always Flush on Shutdown
import atexit
atexit.register(sender.flush)
defer sender.Flush()
Structuring Logs for RCA
RCA uses log lines as evidence. Structured logs with consistent fields give significantly better RCA quality.
Include in Every Log Line
{
"timestamp": "2026-04-30T14:32:11Z",
"service_id": "payment-service",
"severity": "error",
"message": "Payment charge failed",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"error": "upstream timeout after 3000ms",
"provider": "stripe",
"amount_cents": 4999,
"currency": "EUR"
}
Don't Include in Logs
{
"card_number": "4111...", ❌ PCI data — never log
"raw_request_body": "...", ❌ too large, often contains PII
}
Configure field exclusions in InfraSage:
LOG_FIELD_EXCLUSIONS=user_email,card_number,password,token,secret
Organizing Tenants
One Tenant Per Team (Recommended)
Each team owns their API keys, RBAC, and alert configuration independently. The platform team holds the Super-Admin role.
tenant: payments-team → owns payment-service, fraud-service
tenant: identity-team → owns auth-service, session-service
tenant: platform-team → owns api-gateway, k8s infrastructure
One Tenant Per Environment
Alternatively, isolate by environment:
tenant: prod → all production services
tenant: staging → all staging services
Choose team-based tenants if you want independent billing/limits per team. Choose environment-based tenants if you want a single unified view per environment.
Don't Mix Production and Non-Production in One Tenant
Anomaly baselines are per (tenant, service_id). If staging services share a tenant with prod, noisy staging deployments contaminate prod baselines.
Multi-Service Correlation
To get the most out of RCA, ensure dependencies between services are visible to InfraSage via traces. A trace links parent and child spans across services, enabling the causal graph to follow failure paths.
Minimum trace requirements for good RCA:
- Every outbound HTTP call creates a child span with
service_idof the downstream service - Span status is set to
erroron failure (not just a log line) trace_idis propagated viatraceparentheader (W3C Trace Context)
See OpenTelemetry Integration for setup.