Changelog
All notable changes to InfraSage. Follows Semantic Versioning.
v12.x
v12.11 — 2026-04-28
Bug fixes
- Fixed ClickHouse connection pool exhaustion in the vectorizer component that caused login failures after ~50 minutes of uptime. The
PrepareBatchcall was not returning connections to the pool on error paths. - Fixed adaptive threshold baseline reset incorrectly triggering on service restart.
Improvements
- RCA causal graph construction is now 40% faster for services with >50 dependencies.
- DLQ stats endpoint now includes per-topic breakdown.
v12.10 — 2026-04-14
New features
- Runbook dry-run mode: Execute any runbook step in simulation mode to validate it without making changes. Set
dry_run: truein the runbook definition or pass?dry_run=trueto the execution API. - Slack approval flows: Runbook steps with
approval_required: truenow send an interactive Slack message with Approve / Reject buttons. Approvals are logged in the audit trail.
Improvements
- Ingestion Gateway now returns a structured error response on batch validation failures, including the index and field name of each rejected event.
- ML Engine degradation trend detection threshold is now configurable via
ML_DEGRADATION_SLOPE_THRESHOLD.
Breaking changes
- Runbook
steps[].actionrenamed tosteps[].typefor consistency with the telemetry type field. Update existing runbook definitions before upgrading.
v12.9 — 2026-03-31
New features
- Incident memory search: New endpoint
GET /api/v1/rca/similar?anomaly_id=<id>returns the 5 most similar historical RCA reports from the vector index. - Multi-tenant billing API: Super-admins can now query per-tenant event counts for the current billing period via
GET /api/v1/billing/usage.
Improvements
- Vector index now uses HNSW with
ef_construction=200for better recall on incident similarity search. - ClickHouse table TTL expressions updated to use
toStartOfDayfor more predictable retention behavior.
v12.8 — 2026-03-17
New features
- Microsoft Teams integration: Alert notifications and runbook approval flows now support Microsoft Teams via Adaptive Cards. See Microsoft Teams.
- Prometheus remote-write endpoint: Services can now push metrics directly to InfraSage using the Prometheus remote-write protocol at
/api/v1/prometheus/remote_write.
Bug fixes
- Fixed a race condition in the Watchdog ring buffer that could cause Z-scores to be computed on stale data during Kafka consumer lag spikes.
- Fixed Isolation Forest feature matrix not updating after a service's metric set changed.
v12.7 — 2026-03-03
New features
- SLO telemetry type: New
sloevent type for tracking error budget consumption. Separate retention policy, separate Grafana dashboard panel. - Log field exclusions: Configure
LOG_FIELD_EXCLUSIONSto strip sensitive fields from log payloads before storage.
Improvements
- RCA evidence gathering now includes the last 5 deployments (K8s ReplicaSet changes) for each affected service.
- Anomaly detail page in Admin UI now shows the causal graph as an interactive D3 visualization.
v11.x
v11.5 — 2026-02-10
New features
- Jira integration: Automatically create Jira issues when anomalies are declared. Configurable priority mapping, custom fields, and parent epic linking. See Jira Integration.
- API key rotation: New
POST /api/v1/keys/:id/rotateendpoint generates a replacement key and atomically retires the old one.
Bug fixes
- Fixed tenant isolation bypass where a malformed
tenant_idin a JWT could return anomalies from the default tenant. - Fixed ML Engine model drift detector triggering false positives on services with fewer than 7 days of training data.
v11.0 — 2026-01-20
New features
- ML Engine: New service providing XGBoost-based forecasting, ARIMA time-series prediction, degradation trend detection, and causal discovery. See ML Engine.
- Adaptive thresholds: Layer 3 of anomaly detection now adjusts thresholds based on seasonal patterns and infrastructure-aware context.
- Shadow model deployment: ML models can be deployed in shadow mode — predictions are computed but not used for alerting — allowing validation before promotion.
Breaking changes
- Minimum ClickHouse version bumped to 23.8 LTS.
AIOPS_WATCHDOG_INTERVALrenamed toWATCHDOG_INTERVAL_SECONDS.
v10.x
v10.3 — 2025-12-15
New features
- PagerDuty integration: Bidirectional incident lifecycle sync. See PagerDuty Integration.
- Runbook execution history: All runbook executions (including dry-runs) are now logged with inputs, outputs, and duration.
Improvements
- Ingestion Gateway throughput increased ~30% via batched ClickHouse writes.
- Admin UI redesigned with service health overview on the main dashboard.
v10.0 — 2025-11-01
New features
- Multi-tenancy: Full tenant isolation with per-tenant RBAC, API keys, billing limits, and data separation. See Multi-Tenancy.
- RBAC: Five-tier role hierarchy (Viewer, Operator, Admin, Super-Admin, System). See RBAC.
Breaking changes
- All API endpoints now require a
tenant_idcontext. Single-tenant deployments must create a default tenant. - ClickHouse schema migration required (automated via Helm post-upgrade hook).
For versions prior to v10.0, see the archived changelog in the repository.