Glossary
Definitions for terms used throughout the InfraSage documentation.
Adaptive Threshold
A dynamically computed anomaly boundary that accounts for seasonal and infrastructure-aware patterns. Unlike a static threshold, it relaxes during low-traffic periods (e.g., nights) and tightens during high-traffic periods. Used in Layer 3 of the anomaly detection pipeline.
AIOps
AI for IT Operations. A category of platform that uses ML and AI to automate detection, diagnosis, and remediation of infrastructure and application issues. InfraSage is an AIOps platform.
AIops Engine
InfraSage's central processing service. Runs the Watchdog, triggers RCA, manages the vector index, and coordinates runbook execution.
Anomaly
A telemetry signal that deviates significantly from its expected baseline, as determined by Z-score analysis, Isolation Forest scoring, or adaptive threshold comparison. Anomalies are stored in the anomalies table in ClickHouse.
API Key
A static credential used for programmatic access to InfraSage. Keys have a scope (ingestion, readonly, or full) that limits their allowed operations. See API Keys.
Blast Radius
A measure of how many downstream services are likely affected by an anomaly in a given service. Computed from the dependency graph during RCA. A service with high blast radius affects many dependents.
Causal Graph
A directed graph where nodes are services and edges represent dependencies (typically derived from traces). Used during RCA to propagate anomaly signals from a root cause service to downstream dependents.
ClickHouse
The columnar database used by InfraSage for storing all telemetry, anomalies, RCA results, audit logs, and ML model metadata. Chosen for its high-throughput append-write performance and fast analytical queries.
Cooldown
A period after an anomaly is declared during which InfraSage suppresses re-alerting for the same (service_id, metric_name) pair. Prevents alert storms during prolonged incidents. Configurable via WATCHDOG_COOLDOWN_SECONDS.
Dead Letter Queue (DLQ)
A Kafka topic that receives events rejected by the Ingestion Gateway due to validation errors or schema mismatches. DLQ contents are accessible via the /api/v1/dlq/stats endpoint.
Degradation Trend
A slow, monotonic worsening of a metric that has not yet crossed an anomaly threshold. Detected by the ML Engine's degradation trend analysis. Example: error rate creeping from 0.1% to 0.8% over 4 hours.
Forecast
A predicted future value range for a metric, computed by the ML Engine using ARIMA or XGBoost models trained on historical data. Available via the /api/v1/ml/forecast endpoint.
HNSW (Hierarchical Navigable Small World)
The vector index algorithm used by InfraSage to find semantically similar historical incidents. During RCA, InfraSage queries the HNSW index to surface past incidents with similar causal signatures.
Incident Memory
The vector-indexed store of resolved RCA reports. When a new anomaly occurs, InfraSage searches incident memory for similar past events to provide context in the RCA explanation.
Ingestion Gateway
The HTTP service that receives telemetry events from instrumented services. Validates, enriches, and writes events to Kafka. Exposes /api/v1/telemetry and /api/v1/telemetry/batch.
Integration Poller
A component that periodically pulls telemetry from external systems (AWS CloudWatch, GCP Monitoring, Azure Monitor) and converts it to InfraSage's internal telemetry format.
Isolation Forest
An unsupervised ML algorithm used in Layer 2 of anomaly detection. Scores multivariate feature vectors (combining multiple metrics per service) for anomalousness. Good at detecting correlated multi-metric anomalies that Z-score analysis misses.
JWT (JSON Web Token)
The authentication token used by human users. Contains tenant_id, role, and expiry claims. Short-lived (1 hour in production).
Kafka
The message queue used by InfraSage for internal telemetry streaming. Provides durability (24-hour default retention) and decouples ingestion from processing.
ML Engine
InfraSage's machine learning service. Runs XGBoost-based forecasting, degradation trend analysis, causal discovery, and model drift detection. Separate from the AIops Engine.
OTLP (OpenTelemetry Protocol)
The wire protocol used by OpenTelemetry SDKs and Collectors to export telemetry. InfraSage's Ingestion Gateway accepts OTLP over HTTP (/api/v1/otlp) and gRPC.
RCA (Root Cause Analysis)
The process of identifying the likely origin of an anomaly. InfraSage's RCA pipeline builds a causal graph, gathers evidence (recent deployments, correlated anomalies, log lines), and uses Claude to generate a structured explanation.
Runbook
An automated response to an anomaly. A runbook defines a trigger condition and a sequence of steps (Kubernetes actions, HTTP calls, Slack messages, shell commands). Steps can require human approval before execution.
Scale Profile
A named configuration set (small, medium, large) that determines resource allocations for all InfraSage components. Profiles are designed for specific event volume ranges. See Scale Profiles.
Service ID
A stable string identifier for a logical service (e.g., payment-service). The primary grouping dimension for all telemetry. Should not change between deployments.
SLO (Service Level Objective)
A target reliability goal. InfraSage supports SLO telemetry type for tracking error budgets alongside infrastructure metrics.
Telemetry Operator
The InfraSage component that reads from Kafka, applies field transformations and enrichment, and writes to ClickHouse. Also handles tenant isolation and schema validation.
Tenant
An isolated namespace within InfraSage. Each tenant has its own API keys, RBAC configuration, telemetry data, and billing limits. Multi-tenant isolation is enforced at the ClickHouse query level.
Vector Index
A searchable store of embedding vectors representing historical RCA reports. Used to find semantically similar past incidents. Backed by HNSW in-memory indexing.
Watchdog
The anomaly detection loop inside the AIops Engine. Runs on a configurable interval (default: 60 seconds) and evaluates the three-layer detection pipeline (Z-score → Isolation Forest → Adaptive Threshold) for each active (service_id, metric_name) pair.
Z-Score
A statistical measure of how many standard deviations a value is from the rolling mean. Used in Layer 1 of anomaly detection. Z = (value - mean) / stddev. Values with |Z| > threshold (default: 3.0) are flagged as anomalies.