Changelog

All notable changes to InfraSage. Follows Semantic Versioning.

v12.x

v12.11 — 2026-04-28

Bug fixes

Fixed ClickHouse connection pool exhaustion in the vectorizer component that caused login failures after ~50 minutes of uptime. The PrepareBatch call was not returning connections to the pool on error paths.
Fixed adaptive threshold baseline reset incorrectly triggering on service restart.

Improvements

RCA causal graph construction is now 40% faster for services with >50 dependencies.
DLQ stats endpoint now includes per-topic breakdown.

v12.10 — 2026-04-14

New features

Runbook dry-run mode: Execute any runbook step in simulation mode to validate it without making changes. Set dry_run: true in the runbook definition or pass ?dry_run=true to the execution API.
Slack approval flows: Runbook steps with approval_required: true now send an interactive Slack message with Approve / Reject buttons. Approvals are logged in the audit trail.

Improvements

Ingestion Gateway now returns a structured error response on batch validation failures, including the index and field name of each rejected event.
ML Engine degradation trend detection threshold is now configurable via ML_DEGRADATION_SLOPE_THRESHOLD.

Breaking changes

Runbook steps[].action renamed to steps[].type for consistency with the telemetry type field. Update existing runbook definitions before upgrading.

v12.9 — 2026-03-31

New features

Incident memory search: New endpoint GET /api/v1/rca/similar?anomaly_id=<id> returns the 5 most similar historical RCA reports from the vector index.
Multi-tenant billing API: Super-admins can now query per-tenant event counts for the current billing period via GET /api/v1/billing/usage.

Improvements

Vector index now uses HNSW with ef_construction=200 for better recall on incident similarity search.
ClickHouse table TTL expressions updated to use toStartOfDay for more predictable retention behavior.

v12.8 — 2026-03-17

New features

Microsoft Teams integration: Alert notifications and runbook approval flows now support Microsoft Teams via Adaptive Cards. See Microsoft Teams.
Prometheus remote-write endpoint: Services can now push metrics directly to InfraSage using the Prometheus remote-write protocol at /api/v1/prometheus/remote_write.

Bug fixes

Fixed a race condition in the Watchdog ring buffer that could cause Z-scores to be computed on stale data during Kafka consumer lag spikes.
Fixed Isolation Forest feature matrix not updating after a service's metric set changed.

v12.7 — 2026-03-03

New features

SLO telemetry type: New slo event type for tracking error budget consumption. Separate retention policy, separate Grafana dashboard panel.
Log field exclusions: Configure LOG_FIELD_EXCLUSIONS to strip sensitive fields from log payloads before storage.

Improvements

RCA evidence gathering now includes the last 5 deployments (K8s ReplicaSet changes) for each affected service.
Anomaly detail page in Admin UI now shows the causal graph as an interactive D3 visualization.

v11.x

v11.5 — 2026-02-10

New features

Jira integration: Automatically create Jira issues when anomalies are declared. Configurable priority mapping, custom fields, and parent epic linking. See Jira Integration.
API key rotation: New POST /api/v1/keys/:id/rotate endpoint generates a replacement key and atomically retires the old one.

Bug fixes

Fixed tenant isolation bypass where a malformed tenant_id in a JWT could return anomalies from the default tenant.
Fixed ML Engine model drift detector triggering false positives on services with fewer than 7 days of training data.

v11.0 — 2026-01-20

New features

ML Engine: New service providing XGBoost-based forecasting, ARIMA time-series prediction, degradation trend detection, and causal discovery. See ML Engine.
Adaptive thresholds: Layer 3 of anomaly detection now adjusts thresholds based on seasonal patterns and infrastructure-aware context.
Shadow model deployment: ML models can be deployed in shadow mode — predictions are computed but not used for alerting — allowing validation before promotion.

Breaking changes

Minimum ClickHouse version bumped to 23.8 LTS.
AIOPS_WATCHDOG_INTERVAL renamed to WATCHDOG_INTERVAL_SECONDS.

v10.x

v10.3 — 2025-12-15

New features

PagerDuty integration: Bidirectional incident lifecycle sync. See PagerDuty Integration.
Runbook execution history: All runbook executions (including dry-runs) are now logged with inputs, outputs, and duration.

Improvements

Ingestion Gateway throughput increased ~30% via batched ClickHouse writes.
Admin UI redesigned with service health overview on the main dashboard.

v10.0 — 2025-11-01

New features

Multi-tenancy: Full tenant isolation with per-tenant RBAC, API keys, billing limits, and data separation. See Multi-Tenancy.
RBAC: Five-tier role hierarchy (Viewer, Operator, Admin, Super-Admin, System). See RBAC.

Breaking changes

All API endpoints now require a tenant_id context. Single-tenant deployments must create a default tenant.
ClickHouse schema migration required (automated via Helm post-upgrade hook).

For versions prior to v10.0, see the archived changelog in the repository.

v12.x​

v12.11 — 2026-04-28​

v12.10 — 2026-04-14​

v12.9 — 2026-03-31​

v12.8 — 2026-03-17​

v12.7 — 2026-03-03​

v11.x​

v11.5 — 2026-02-10​

v11.0 — 2026-01-20​

v10.x​

v10.3 — 2025-12-15​

v10.0 — 2025-11-01​

v12.x

v12.11 — 2026-04-28

v12.10 — 2026-04-14

v12.9 — 2026-03-31

v12.8 — 2026-03-17

v12.7 — 2026-03-03

v11.x

v11.5 — 2026-02-10

v11.0 — 2026-01-20

v10.x

v10.3 — 2025-12-15

v10.0 — 2025-11-01