Vectorizer
The Vectorizer is the embedding pipeline that converts raw telemetry signals into fixed-dimension numerical vectors, one per service per minute. These vectors power InfraSage's ML anomaly detection, causal analysis, and the pillar embedding system.
How It Works
ClickHouse telemetry_signals + aggregated_metrics
│ (GROUP BY service_id, 1-minute window)
▼
18-dimensional embedding vector
┌──────────────────────────────────────────────┐
│ Dims 0–1 : metric aggregates (avg, max) │
│ Dims 2–5 : signal ratios (error, warn, │
│ log, trace) │
│ Dims 6–7 : topology (inbound, outbound) │
│ Dims 8–12 : time features (sin/cos + flag) │
│ Dims 13–17 : named metrics (p99, http err, │
│ queue depth, db health, fatal) │
└──────────────────────────────────────────────┘
│
▼
Euclidean distance vs. previous window embedding
│
▼
Weirdness score [0, ∞) per service
(stored in infrasage_anomaly_scores)
│
▼ (async)
Pillar embedding workers
(one goroutine per active pillar)
The pipeline runs every 60 seconds (configurable via VECTORIZER_INTERVAL_SECONDS). Scores are stored in ClickHouse and picked up by the Watchdog on the next poll cycle.
Embedding Dimensions
Default 18-dimensional layout
| Dim index | Name | Source | Scale |
|---|---|---|---|
| 0 | avg_value | Average value across all aggregated metrics for the window | linear, ceiling 1000 |
| 1 | max_value | Peak metric value for the window | linear, ceiling 5000 |
| 2 | error_rate | error_count / (log+trace+metric+event) total | linear, ceiling 1 |
| 3 | warn_rate | warn_count / total events | linear, ceiling 1 |
| 4 | log_ratio | log_count / total events | linear, ceiling 1 |
| 5 | trace_ratio | trace_count / total events | linear, ceiling 1 |
| 6 | inbound_edges | Inbound service-topology edges (log-scale, ceiling 10M) | log-scale |
| 7 | outbound_edges | Outbound topology edges (log-scale, ceiling 10M) | log-scale |
| 8 | time_sin | sin(2π × second_of_day / 86400) — hour-of-day | cyclic |
| 9 | time_cos | cos(2π × second_of_day / 86400) | cyclic |
| 10 | day_sin | sin(2π × weekday / 7) — day-of-week | cyclic |
| 11 | day_cos | cos(2π × weekday / 7) | cyclic |
| 12 | is_weekend | 1 if Saturday/Sunday, 0 otherwise | binary |
| 13 | p99_latency | Average p99 latency (matched via dim-alias list) | linear, ceiling 5000 ms |
| 14 | http_error_rate | Average HTTP error rate (matched via dim-alias list) | linear, ceiling 100 |
| 15 | queue_depth_peak | Peak queue depth (matched via dim-alias list) | linear, ceiling 10 000 |
| 16 | db_health | Average DB health score (matched via dim-alias list) | linear, ceiling 10 |
| 17 | fatal_rate | fatal_count / log_count | linear, ceiling 1 |
Three additional infrastructure dimensions are appended when INFRASTRUCTURE_DIMENSIONS_ENABLED=true:
| Dim index | Name | Description | Scale |
|---|---|---|---|
| 18 | disk_utilization | Disk utilization percentage | linear, ceiling 100 |
| 19 | network_error_rate | Network error rate percentage | linear, ceiling 100 |
| 20 | connection_pool_utilization | DB connection pool utilization % | linear, ceiling 100 |
Operator-defined custom slots are appended after the base dims (starting at dim 21 by default, up to MAX_CUSTOM_EMBEDDING_SLOTS slots — 8 by default):
| Dim index | Name | Source |
|---|---|---|
| 21+N | custom_slot_N | Operator-defined metric (configurable label, metrics, ceiling) |
Dims 8–12 (cyclic time features) are included in the dimensions count and in the dimension_labels list returned by the stats API.
Per-Service Embedding Configuration
By default every service uses the global embedding configuration. You can override individual services with a custom set of metrics via the UI (Vectorizer page) or API.
Modes
| Mode | Behaviour |
|---|---|
auto | InfraSage auto-selects the highest-signal metrics for this service from the last N days |
manual | You specify exactly which metrics map to which embedding slots |
disabled | Service is excluded from the embedding pipeline |
Auto-select
Auto-select runs a metric scoring algorithm that ranks candidate metrics by:
- Coefficient of Variation (CV) — metrics with more natural variance carry more information
- Coverage — how consistently the metric is reported across windows
- Informativeness score — combined weighted rank
# Preview which metrics auto-select would choose for a service
GET /api/v1/embedding/config/payment-service/auto-preview
Response includes the ranked candidate list and which metrics would be selected with their suggested ceilings.
Manual configuration
curl -X PUT https://api.infrasage.dev/api/v1/embedding/config/payment-service \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"mode": "manual",
"slots": [
{ "label": "checkout_errors", "metrics": ["checkout_error_rate"], "ceiling": 1 },
{ "label": "payment_latency", "metrics": ["payment_p99_ms"], "ceiling": 3000 },
{ "label": "cart_volume", "metrics": ["cart_events_total"], "ceiling": 0 }
]
}'
Global Custom Slots
Dims 21 onward are global custom slots shared across all services that do not have a per-service override. By default InfraSage reserves 8 slots (dims 21–28), configurable via MAX_CUSTOM_EMBEDDING_SLOTS. Operators can assign any metric to these slots via the API or the Vectorizer UI page.
# List current global slot assignments
GET /api/v1/embedding/global-slots
# Assign a metric to slot 0 (dim 21)
PUT /api/v1/embedding/global-slots
{
"slots": [
{ "index": 0, "label": "redis_latency_ms", "metrics": ["redis_cmd_duration_p99"], "ceiling": 500 }
]
}
Weirdness Scoring
The per-minute weirdness score for each service is the Euclidean distance between its current embedding and its embedding from the immediately preceding window:
score = sqrt( Σ (current_dim_i − previous_dim_i)² )
A service whose behaviour changes sharply from one minute to the next will have a high score. A service that is in a steady state will score near 0. Scores are stored in infrasage_anomaly_scores and consumed by the Watchdog.
The score is unbounded (not clamped to [0, 1]). The Watchdog applies a configurable Z-score threshold on top of each service's rolling score history to decide whether to raise an alert.
HNSW Index Parameters
InfraSage's ClickHouse infrasage_embeddings table can be configured with an ANN (Approximate Nearest Neighbor) index using the HNSW algorithm. This index is used for vector similarity search — for example, finding past incidents with embeddings similar to the current anomaly (used in RCA historical matching).
| Parameter | Env var | Default | Description |
|---|---|---|---|
hnsw_m | VECTOR_HNSW_M | 16 | Max connections per node — higher = better recall, more memory |
ef_construction | VECTOR_HNSW_EF_CONSTRUCTION | 200 | Build-time beam width — higher = better quality, slower build |
search_top_k | VECTOR_SEARCH_TOP_K | 3 | Top-K neighbours to retrieve at query time |
These parameters do not affect the per-minute weirdness score calculation (which uses direct Euclidean distance). They affect ANN index quality for similarity search queries.
Auto-Select Configuration
# Enable / disable global auto-select (default: true)
AUTO_EMBEDDING_SELECT_ENABLED=true
# How many days of history to score candidates against (default: 7)
AUTO_EMBEDDING_SELECT_LOOKBACK_DAYS=7
# Max distinct custom metric names in one processWindow SQL query (default: 64)
EMBEDDING_UNION_METRIC_CAP=64
# Reserved custom embedding slot count, default 8 (slots start at dim 21)
MAX_CUSTOM_EMBEDDING_SLOTS=8
# Enable 3 additional infrastructure dims 18-20 (default: false)
INFRASTRUCTURE_DIMENSIONS_ENABLED=false
Stats API
GET /api/v1/vectorizer/stats
{
"dimensions": 18,
"algorithm": "HNSW",
"hnsw_m": 16,
"ef_construction": 200,
"search_top_k": 3,
"interval_seconds": 60,
"dimension_labels": [
"avg_value", "max_value",
"error_rate", "warn_rate",
"log_ratio", "trace_ratio",
"inbound_edges", "outbound_edges",
"time_sin", "time_cos",
"day_sin", "day_cos",
"is_weekend",
"p99_latency", "http_error_rate",
"queue_depth_peak", "db_health",
"fatal_rate"
],
"auto_select_enabled": true,
"auto_select_lookback_days": 7,
"embedding_union_metric_cap": 64,
"per_service_config_endpoint": "/api/v1/embedding/config/:service_id",
"per_service_config_list_endpoint": "/api/v1/embedding/configs",
"total_embeddings": 12353,
"services_covered": 11,
"last_window": "2026-05-13T10:23:00Z",
"per_service_config_count": 3
}
Dim Aliases
The Vectorizer ships with human-readable aliases for every built-in dimension. These are used in log output, Prometheus metric labels, and the UI.
GET /api/v1/embedding/dim-aliases
Viewing Results
Vectorizer UI Page
Navigate to Vectorizer in the sidebar (operator role required).
- Stats panel — current dimensions, HNSW parameters, embedding coverage
- Per-service config table — mode, slot assignments, last updated
- Config modal — edit or switch mode for any service; preview auto-select candidates
- Global custom slots — assign custom metrics to dims 21–29
Via Grafana
The ml-quality.json dashboard (included in the dashboards/ directory) visualises:
- Embeddings ingested per minute
- Services covered over time
- Weirdness score distribution
Troubleshooting
services_covered is 0 or lower than expected
- Check that services are actively sending telemetry. The vectorizer only processes windows where at least one telemetry signal exists for the service.
- Verify the vectorizer tick is running: look for
processing embedding windowin aiops-engine logs.
Scores are always near 0 for a new service
This is expected during the warm-up period (first ~30 minutes of data). The HNSW index needs enough baseline vectors before cosine distance is meaningful. The Watchdog marks new services as warming_up and suppresses alerts during this period.
A service has a permanently high weirdness score
The baseline may have been built from anomalous data. Reset the baseline by deleting that service's embeddings from ClickHouse:
ALTER TABLE infrasage_service_embeddings DELETE WHERE service_id = 'my-service';
The index will rebuild automatically on the next vectorizer cycle.