Skip to main content

Changelog

All notable changes to InfraSage. Follows Semantic Versioning.


v12.x

v12.11 — 2026-04-28

Bug fixes

  • Fixed ClickHouse connection pool exhaustion in the vectorizer component that caused login failures after ~50 minutes of uptime. The PrepareBatch call was not returning connections to the pool on error paths.
  • Fixed adaptive threshold baseline reset incorrectly triggering on service restart.

Improvements

  • RCA causal graph construction is now 40% faster for services with >50 dependencies.
  • DLQ stats endpoint now includes per-topic breakdown.

v12.10 — 2026-04-14

New features

  • Runbook dry-run mode: Execute any runbook step in simulation mode to validate it without making changes. Set dry_run: true in the runbook definition or pass ?dry_run=true to the execution API.
  • Slack approval flows: Runbook steps with approval_required: true now send an interactive Slack message with Approve / Reject buttons. Approvals are logged in the audit trail.

Improvements

  • Ingestion Gateway now returns a structured error response on batch validation failures, including the index and field name of each rejected event.
  • ML Engine degradation trend detection threshold is now configurable via ML_DEGRADATION_SLOPE_THRESHOLD.

Breaking changes

  • Runbook steps[].action renamed to steps[].type for consistency with the telemetry type field. Update existing runbook definitions before upgrading.

v12.9 — 2026-03-31

New features

  • Incident memory search: New endpoint GET /api/v1/rca/similar?anomaly_id=<id> returns the 5 most similar historical RCA reports from the vector index.
  • Multi-tenant billing API: Super-admins can now query per-tenant event counts for the current billing period via GET /api/v1/billing/usage.

Improvements

  • Vector index now uses HNSW with ef_construction=200 for better recall on incident similarity search.
  • ClickHouse table TTL expressions updated to use toStartOfDay for more predictable retention behavior.

v12.8 — 2026-03-17

New features

  • Microsoft Teams integration: Alert notifications and runbook approval flows now support Microsoft Teams via Adaptive Cards. See Microsoft Teams.
  • Prometheus remote-write endpoint: Services can now push metrics directly to InfraSage using the Prometheus remote-write protocol at /api/v1/prometheus/remote_write.

Bug fixes

  • Fixed a race condition in the Watchdog ring buffer that could cause Z-scores to be computed on stale data during Kafka consumer lag spikes.
  • Fixed Isolation Forest feature matrix not updating after a service's metric set changed.

v12.7 — 2026-03-03

New features

  • SLO telemetry type: New slo event type for tracking error budget consumption. Separate retention policy, separate Grafana dashboard panel.
  • Log field exclusions: Configure LOG_FIELD_EXCLUSIONS to strip sensitive fields from log payloads before storage.

Improvements

  • RCA evidence gathering now includes the last 5 deployments (K8s ReplicaSet changes) for each affected service.
  • Anomaly detail page in Admin UI now shows the causal graph as an interactive D3 visualization.

v11.x

v11.5 — 2026-02-10

New features

  • Jira integration: Automatically create Jira issues when anomalies are declared. Configurable priority mapping, custom fields, and parent epic linking. See Jira Integration.
  • API key rotation: New POST /api/v1/keys/:id/rotate endpoint generates a replacement key and atomically retires the old one.

Bug fixes

  • Fixed tenant isolation bypass where a malformed tenant_id in a JWT could return anomalies from the default tenant.
  • Fixed ML Engine model drift detector triggering false positives on services with fewer than 7 days of training data.

v11.0 — 2026-01-20

New features

  • ML Engine: New service providing XGBoost-based forecasting, ARIMA time-series prediction, degradation trend detection, and causal discovery. See ML Engine.
  • Adaptive thresholds: Layer 3 of anomaly detection now adjusts thresholds based on seasonal patterns and infrastructure-aware context.
  • Shadow model deployment: ML models can be deployed in shadow mode — predictions are computed but not used for alerting — allowing validation before promotion.

Breaking changes

  • Minimum ClickHouse version bumped to 23.8 LTS.
  • AIOPS_WATCHDOG_INTERVAL renamed to WATCHDOG_INTERVAL_SECONDS.

v10.x

v10.3 — 2025-12-15

New features

  • PagerDuty integration: Bidirectional incident lifecycle sync. See PagerDuty Integration.
  • Runbook execution history: All runbook executions (including dry-runs) are now logged with inputs, outputs, and duration.

Improvements

  • Ingestion Gateway throughput increased ~30% via batched ClickHouse writes.
  • Admin UI redesigned with service health overview on the main dashboard.

v10.0 — 2025-11-01

New features

  • Multi-tenancy: Full tenant isolation with per-tenant RBAC, API keys, billing limits, and data separation. See Multi-Tenancy.
  • RBAC: Five-tier role hierarchy (Viewer, Operator, Admin, Super-Admin, System). See RBAC.

Breaking changes

  • All API endpoints now require a tenant_id context. Single-tenant deployments must create a default tenant.
  • ClickHouse schema migration required (automated via Helm post-upgrade hook).

For versions prior to v10.0, see the archived changelog in the repository.