Skip to main content

Vectorizer

The Vectorizer is the embedding pipeline that converts raw telemetry signals into fixed-dimension numerical vectors, one per service per minute. These vectors power InfraSage's ML anomaly detection, causal analysis, and the pillar embedding system.


How It Works

ClickHouse telemetry_signals + aggregated_metrics
│ (GROUP BY service_id, 1-minute window)

18-dimensional embedding vector
┌──────────────────────────────────────────────┐
│ Dims 0–1 : metric aggregates (avg, max) │
│ Dims 2–5 : signal ratios (error, warn, │
│ log, trace) │
│ Dims 6–7 : topology (inbound, outbound) │
│ Dims 8–12 : time features (sin/cos + flag) │
│ Dims 13–17 : named metrics (p99, http err, │
│ queue depth, db health, fatal) │
└──────────────────────────────────────────────┘


Euclidean distance vs. previous window embedding


Weirdness score [0, ∞) per service
(stored in infrasage_anomaly_scores)

▼ (async)
Pillar embedding workers
(one goroutine per active pillar)

The pipeline runs every 60 seconds (configurable via VECTORIZER_INTERVAL_SECONDS). Scores are stored in ClickHouse and picked up by the Watchdog on the next poll cycle.


Embedding Dimensions

Default 18-dimensional layout

Dim indexNameSourceScale
0avg_valueAverage value across all aggregated metrics for the windowlinear, ceiling 1000
1max_valuePeak metric value for the windowlinear, ceiling 5000
2error_rateerror_count / (log+trace+metric+event) totallinear, ceiling 1
3warn_ratewarn_count / total eventslinear, ceiling 1
4log_ratiolog_count / total eventslinear, ceiling 1
5trace_ratiotrace_count / total eventslinear, ceiling 1
6inbound_edgesInbound service-topology edges (log-scale, ceiling 10M)log-scale
7outbound_edgesOutbound topology edges (log-scale, ceiling 10M)log-scale
8time_sinsin(2π × second_of_day / 86400) — hour-of-daycyclic
9time_coscos(2π × second_of_day / 86400)cyclic
10day_sinsin(2π × weekday / 7) — day-of-weekcyclic
11day_coscos(2π × weekday / 7)cyclic
12is_weekend1 if Saturday/Sunday, 0 otherwisebinary
13p99_latencyAverage p99 latency (matched via dim-alias list)linear, ceiling 5000 ms
14http_error_rateAverage HTTP error rate (matched via dim-alias list)linear, ceiling 100
15queue_depth_peakPeak queue depth (matched via dim-alias list)linear, ceiling 10 000
16db_healthAverage DB health score (matched via dim-alias list)linear, ceiling 10
17fatal_ratefatal_count / log_countlinear, ceiling 1

Three additional infrastructure dimensions are appended when INFRASTRUCTURE_DIMENSIONS_ENABLED=true:

Dim indexNameDescriptionScale
18disk_utilizationDisk utilization percentagelinear, ceiling 100
19network_error_rateNetwork error rate percentagelinear, ceiling 100
20connection_pool_utilizationDB connection pool utilization %linear, ceiling 100

Operator-defined custom slots are appended after the base dims (starting at dim 21 by default, up to MAX_CUSTOM_EMBEDDING_SLOTS slots — 8 by default):

Dim indexNameSource
21+Ncustom_slot_NOperator-defined metric (configurable label, metrics, ceiling)

Dims 8–12 (cyclic time features) are included in the dimensions count and in the dimension_labels list returned by the stats API.


Per-Service Embedding Configuration

By default every service uses the global embedding configuration. You can override individual services with a custom set of metrics via the UI (Vectorizer page) or API.

Modes

ModeBehaviour
autoInfraSage auto-selects the highest-signal metrics for this service from the last N days
manualYou specify exactly which metrics map to which embedding slots
disabledService is excluded from the embedding pipeline

Auto-select

Auto-select runs a metric scoring algorithm that ranks candidate metrics by:

  • Coefficient of Variation (CV) — metrics with more natural variance carry more information
  • Coverage — how consistently the metric is reported across windows
  • Informativeness score — combined weighted rank
# Preview which metrics auto-select would choose for a service
GET /api/v1/embedding/config/payment-service/auto-preview

Response includes the ranked candidate list and which metrics would be selected with their suggested ceilings.

Manual configuration

curl -X PUT https://api.infrasage.dev/api/v1/embedding/config/payment-service \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"mode": "manual",
"slots": [
{ "label": "checkout_errors", "metrics": ["checkout_error_rate"], "ceiling": 1 },
{ "label": "payment_latency", "metrics": ["payment_p99_ms"], "ceiling": 3000 },
{ "label": "cart_volume", "metrics": ["cart_events_total"], "ceiling": 0 }
]
}'

Global Custom Slots

Dims 21 onward are global custom slots shared across all services that do not have a per-service override. By default InfraSage reserves 8 slots (dims 21–28), configurable via MAX_CUSTOM_EMBEDDING_SLOTS. Operators can assign any metric to these slots via the API or the Vectorizer UI page.

# List current global slot assignments
GET /api/v1/embedding/global-slots

# Assign a metric to slot 0 (dim 21)
PUT /api/v1/embedding/global-slots
{
"slots": [
{ "index": 0, "label": "redis_latency_ms", "metrics": ["redis_cmd_duration_p99"], "ceiling": 500 }
]
}

Weirdness Scoring

The per-minute weirdness score for each service is the Euclidean distance between its current embedding and its embedding from the immediately preceding window:

score = sqrt( Σ (current_dim_i − previous_dim_i)² )

A service whose behaviour changes sharply from one minute to the next will have a high score. A service that is in a steady state will score near 0. Scores are stored in infrasage_anomaly_scores and consumed by the Watchdog.

The score is unbounded (not clamped to [0, 1]). The Watchdog applies a configurable Z-score threshold on top of each service's rolling score history to decide whether to raise an alert.


HNSW Index Parameters

InfraSage's ClickHouse infrasage_embeddings table can be configured with an ANN (Approximate Nearest Neighbor) index using the HNSW algorithm. This index is used for vector similarity search — for example, finding past incidents with embeddings similar to the current anomaly (used in RCA historical matching).

ParameterEnv varDefaultDescription
hnsw_mVECTOR_HNSW_M16Max connections per node — higher = better recall, more memory
ef_constructionVECTOR_HNSW_EF_CONSTRUCTION200Build-time beam width — higher = better quality, slower build
search_top_kVECTOR_SEARCH_TOP_K3Top-K neighbours to retrieve at query time

These parameters do not affect the per-minute weirdness score calculation (which uses direct Euclidean distance). They affect ANN index quality for similarity search queries.


Auto-Select Configuration

# Enable / disable global auto-select (default: true)
AUTO_EMBEDDING_SELECT_ENABLED=true

# How many days of history to score candidates against (default: 7)
AUTO_EMBEDDING_SELECT_LOOKBACK_DAYS=7

# Max distinct custom metric names in one processWindow SQL query (default: 64)
EMBEDDING_UNION_METRIC_CAP=64

# Reserved custom embedding slot count, default 8 (slots start at dim 21)
MAX_CUSTOM_EMBEDDING_SLOTS=8

# Enable 3 additional infrastructure dims 18-20 (default: false)
INFRASTRUCTURE_DIMENSIONS_ENABLED=false

Stats API

GET /api/v1/vectorizer/stats
{
"dimensions": 18,
"algorithm": "HNSW",
"hnsw_m": 16,
"ef_construction": 200,
"search_top_k": 3,
"interval_seconds": 60,
"dimension_labels": [
"avg_value", "max_value",
"error_rate", "warn_rate",
"log_ratio", "trace_ratio",
"inbound_edges", "outbound_edges",
"time_sin", "time_cos",
"day_sin", "day_cos",
"is_weekend",
"p99_latency", "http_error_rate",
"queue_depth_peak", "db_health",
"fatal_rate"
],
"auto_select_enabled": true,
"auto_select_lookback_days": 7,
"embedding_union_metric_cap": 64,
"per_service_config_endpoint": "/api/v1/embedding/config/:service_id",
"per_service_config_list_endpoint": "/api/v1/embedding/configs",
"total_embeddings": 12353,
"services_covered": 11,
"last_window": "2026-05-13T10:23:00Z",
"per_service_config_count": 3
}

Dim Aliases

The Vectorizer ships with human-readable aliases for every built-in dimension. These are used in log output, Prometheus metric labels, and the UI.

GET /api/v1/embedding/dim-aliases

Viewing Results

Vectorizer UI Page

Navigate to Vectorizer in the sidebar (operator role required).

  • Stats panel — current dimensions, HNSW parameters, embedding coverage
  • Per-service config table — mode, slot assignments, last updated
  • Config modal — edit or switch mode for any service; preview auto-select candidates
  • Global custom slots — assign custom metrics to dims 21–29

Via Grafana

The ml-quality.json dashboard (included in the dashboards/ directory) visualises:

  • Embeddings ingested per minute
  • Services covered over time
  • Weirdness score distribution

Troubleshooting

services_covered is 0 or lower than expected

  • Check that services are actively sending telemetry. The vectorizer only processes windows where at least one telemetry signal exists for the service.
  • Verify the vectorizer tick is running: look for processing embedding window in aiops-engine logs.

Scores are always near 0 for a new service

This is expected during the warm-up period (first ~30 minutes of data). The HNSW index needs enough baseline vectors before cosine distance is meaningful. The Watchdog marks new services as warming_up and suppresses alerts during this period.

A service has a permanently high weirdness score

The baseline may have been built from anomalous data. Reset the baseline by deleting that service's embeddings from ClickHouse:

ALTER TABLE infrasage_service_embeddings DELETE WHERE service_id = 'my-service';

The index will rebuild automatically on the next vectorizer cycle.