Vectorizer

The Vectorizer is the embedding pipeline that converts raw telemetry signals into fixed-dimension numerical vectors, one per service per minute. These vectors power InfraSage's ML anomaly detection, causal analysis, and the pillar embedding system.

How It Works

ClickHouse telemetry_signals + aggregated_metrics
         │  (GROUP BY service_id, 1-minute window)
         ▼
  18-dimensional embedding vector
  ┌──────────────────────────────────────────────┐
  │  Dims 0–1   : metric aggregates (avg, max)   │
  │  Dims 2–5   : signal ratios (error, warn,    │
  │               log, trace)                    │
  │  Dims 6–7   : topology (inbound, outbound)   │
  │  Dims 8–12  : time features (sin/cos + flag) │
  │  Dims 13–17 : named metrics (p99, http err,  │
  │               queue depth, db health, fatal) │
  └──────────────────────────────────────────────┘
         │
         ▼
  Euclidean distance vs. previous window embedding
         │
         ▼
  Weirdness score [0, ∞) per service
  (stored in infrasage_anomaly_scores)
         │
         ▼  (async)
  Pillar embedding workers
  (one goroutine per active pillar)

The pipeline runs every 60 seconds (configurable via VECTORIZER_INTERVAL_SECONDS). Scores are stored in ClickHouse and picked up by the Watchdog on the next poll cycle.

Embedding Dimensions

Default 18-dimensional layout

Dim index	Name	Source	Scale
0	`avg_value`	Average value across all aggregated metrics for the window	linear, ceiling 1000
1	`max_value`	Peak metric value for the window	linear, ceiling 5000
2	`error_rate`	`error_count / (log+trace+metric+event)` total	linear, ceiling 1
3	`warn_rate`	`warn_count / total events`	linear, ceiling 1
4	`log_ratio`	`log_count / total events`	linear, ceiling 1
5	`trace_ratio`	`trace_count / total events`	linear, ceiling 1
6	`inbound_edges`	Inbound service-topology edges (log-scale, ceiling 10M)	log-scale
7	`outbound_edges`	Outbound topology edges (log-scale, ceiling 10M)	log-scale
8	`time_sin`	`sin(2π × second_of_day / 86400)` — hour-of-day	cyclic
9	`time_cos`	`cos(2π × second_of_day / 86400)`	cyclic
10	`day_sin`	`sin(2π × weekday / 7)` — day-of-week	cyclic
11	`day_cos`	`cos(2π × weekday / 7)`	cyclic
12	`is_weekend`	`1` if Saturday/Sunday, `0` otherwise	binary
13	`p99_latency`	Average p99 latency (matched via dim-alias list)	linear, ceiling 5000 ms
14	`http_error_rate`	Average HTTP error rate (matched via dim-alias list)	linear, ceiling 100
15	`queue_depth_peak`	Peak queue depth (matched via dim-alias list)	linear, ceiling 10 000
16	`db_health`	Average DB health score (matched via dim-alias list)	linear, ceiling 10
17	`fatal_rate`	`fatal_count / log_count`	linear, ceiling 1

Three additional infrastructure dimensions are appended when INFRASTRUCTURE_DIMENSIONS_ENABLED=true:

Dim index	Name	Description	Scale
18	`disk_utilization`	Disk utilization percentage	linear, ceiling 100
19	`network_error_rate`	Network error rate percentage	linear, ceiling 100
20	`connection_pool_utilization`	DB connection pool utilization %	linear, ceiling 100

Operator-defined custom slots are appended after the base dims (starting at dim 21 by default, up to MAX_CUSTOM_EMBEDDING_SLOTS slots — 8 by default):

Dim index	Name	Source
21+N	`custom_slot_N`	Operator-defined metric (configurable label, metrics, ceiling)

Dims 8–12 (cyclic time features) are included in the dimensions count and in the dimension_labels list returned by the stats API.

Per-Service Embedding Configuration

By default every service uses the global embedding configuration. You can override individual services with a custom set of metrics via the UI (Vectorizer page) or API.

Modes

Mode	Behaviour
`auto`	InfraSage auto-selects the highest-signal metrics for this service from the last N days
`manual`	You specify exactly which metrics map to which embedding slots
`disabled`	Service is excluded from the embedding pipeline

Auto-select

Auto-select runs a metric scoring algorithm that ranks candidate metrics by:

Coefficient of Variation (CV) — metrics with more natural variance carry more information
Coverage — how consistently the metric is reported across windows
Informativeness score — combined weighted rank

# Preview which metrics auto-select would choose for a service
GET /api/v1/embedding/config/payment-service/auto-preview

Response includes the ranked candidate list and which metrics would be selected with their suggested ceilings.

Manual configuration

curl -X PUT https://api.infrasage.dev/api/v1/embedding/config/payment-service \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "manual",
    "slots": [
      { "label": "checkout_errors",  "metrics": ["checkout_error_rate"],   "ceiling": 1 },
      { "label": "payment_latency",  "metrics": ["payment_p99_ms"],        "ceiling": 3000 },
      { "label": "cart_volume",      "metrics": ["cart_events_total"],      "ceiling": 0 }
    ]
  }'

Global Custom Slots

Dims 21 onward are global custom slots shared across all services that do not have a per-service override. By default InfraSage reserves 8 slots (dims 21–28), configurable via MAX_CUSTOM_EMBEDDING_SLOTS. Operators can assign any metric to these slots via the API or the Vectorizer UI page.

# List current global slot assignments
GET /api/v1/embedding/global-slots

# Assign a metric to slot 0 (dim 21)
PUT /api/v1/embedding/global-slots
{
  "slots": [
    { "index": 0, "label": "redis_latency_ms", "metrics": ["redis_cmd_duration_p99"], "ceiling": 500 }
  ]
}

Weirdness Scoring

The per-minute weirdness score for each service is the Euclidean distance between its current embedding and its embedding from the immediately preceding window:

score = sqrt( Σ (current_dim_i − previous_dim_i)² )

A service whose behaviour changes sharply from one minute to the next will have a high score. A service that is in a steady state will score near 0. Scores are stored in infrasage_anomaly_scores and consumed by the Watchdog.

The score is unbounded (not clamped to [0, 1]). The Watchdog applies a configurable Z-score threshold on top of each service's rolling score history to decide whether to raise an alert.

HNSW Index Parameters

InfraSage's ClickHouse infrasage_embeddings table can be configured with an ANN (Approximate Nearest Neighbor) index using the HNSW algorithm. This index is used for vector similarity search — for example, finding past incidents with embeddings similar to the current anomaly (used in RCA historical matching).

Parameter	Env var	Default	Description
`hnsw_m`	`VECTOR_HNSW_M`	`16`	Max connections per node — higher = better recall, more memory
`ef_construction`	`VECTOR_HNSW_EF_CONSTRUCTION`	`200`	Build-time beam width — higher = better quality, slower build
`search_top_k`	`VECTOR_SEARCH_TOP_K`	`3`	Top-K neighbours to retrieve at query time

These parameters do not affect the per-minute weirdness score calculation (which uses direct Euclidean distance). They affect ANN index quality for similarity search queries.

Auto-Select Configuration

# Enable / disable global auto-select (default: true)
AUTO_EMBEDDING_SELECT_ENABLED=true

# How many days of history to score candidates against (default: 7)
AUTO_EMBEDDING_SELECT_LOOKBACK_DAYS=7

# Max distinct custom metric names in one processWindow SQL query (default: 64)
EMBEDDING_UNION_METRIC_CAP=64

# Reserved custom embedding slot count, default 8 (slots start at dim 21)
MAX_CUSTOM_EMBEDDING_SLOTS=8

# Enable 3 additional infrastructure dims 18-20 (default: false)
INFRASTRUCTURE_DIMENSIONS_ENABLED=false

Stats API

GET /api/v1/vectorizer/stats

{
  "dimensions": 18,
  "algorithm": "HNSW",
  "hnsw_m": 16,
  "ef_construction": 200,
  "search_top_k": 3,
  "interval_seconds": 60,
  "dimension_labels": [
    "avg_value", "max_value",
    "error_rate", "warn_rate",
    "log_ratio", "trace_ratio",
    "inbound_edges", "outbound_edges",
    "time_sin", "time_cos",
    "day_sin", "day_cos",
    "is_weekend",
    "p99_latency", "http_error_rate",
    "queue_depth_peak", "db_health",
    "fatal_rate"
  ],
  "auto_select_enabled": true,
  "auto_select_lookback_days": 7,
  "embedding_union_metric_cap": 64,
  "per_service_config_endpoint": "/api/v1/embedding/config/:service_id",
  "per_service_config_list_endpoint": "/api/v1/embedding/configs",
  "total_embeddings": 12353,
  "services_covered": 11,
  "last_window": "2026-05-13T10:23:00Z",
  "per_service_config_count": 3
}

Dim Aliases

The Vectorizer ships with human-readable aliases for every built-in dimension. These are used in log output, Prometheus metric labels, and the UI.

GET /api/v1/embedding/dim-aliases

Viewing Results

Vectorizer UI Page

Navigate to Vectorizer in the sidebar (operator role required).

Stats panel — current dimensions, HNSW parameters, embedding coverage
Per-service config table — mode, slot assignments, last updated
Config modal — edit or switch mode for any service; preview auto-select candidates
Global custom slots — assign custom metrics to dims 21–29

Via Grafana

The ml-quality.json dashboard (included in the dashboards/ directory) visualises:

Embeddings ingested per minute
Services covered over time
Weirdness score distribution

Troubleshooting

services_covered is 0 or lower than expected

Check that services are actively sending telemetry. The vectorizer only processes windows where at least one telemetry signal exists for the service.
Verify the vectorizer tick is running: look for processing embedding window in aiops-engine logs.

Scores are always near 0 for a new service

This is expected during the warm-up period (first ~30 minutes of data). The HNSW index needs enough baseline vectors before cosine distance is meaningful. The Watchdog marks new services as warming_up and suppresses alerts during this period.

A service has a permanently high weirdness score

The baseline may have been built from anomalous data. Reset the baseline by deleting that service's embeddings from ClickHouse:

ALTER TABLE infrasage_service_embeddings DELETE WHERE service_id = 'my-service';

The index will rebuild automatically on the next vectorizer cycle.

How It Works​

Embedding Dimensions​

Default 18-dimensional layout​

Per-Service Embedding Configuration​

Modes​

Auto-select​

Manual configuration​

Global Custom Slots​

Weirdness Scoring​

HNSW Index Parameters​

Auto-Select Configuration​

Stats API​

Dim Aliases​

Viewing Results​

Vectorizer UI Page​

Via Grafana​

Troubleshooting​