Pillar Embeddings
Pillar Embeddings generalise InfraSage's ML anomaly detection beyond the individual service boundary. Instead of asking "Is this service behaving unusually?", pillar embeddings ask "Is this tenant, product, or any other business dimension you define behaving unusually — in aggregate?"
This gives you a second layer of visibility that service-level anomalies can miss: a single service may look fine in isolation, but 50 services belonging to one tenant may collectively show an elevated error rate and rising latency that only becomes visible when you aggregate them.
Core Idea
Every minute, for each active pillar:
Services with a mapping in the service-map
│
▼ GROUP BY <key_expr>
One embedding vector per key value
(e.g. one vector per tenant_id)
│
▼ cosine distance vs. baseline
Weirdness score [0, 1] per key
│
▼
Stored in ClickHouse → queryable via API + UI
Each embedding vector has N behavioural dims + 5 cyclic time features (time_sin, time_cos, day_sin, day_cos, is_weekend). The time features allow the model to distinguish "elevated volume on Monday morning" (normal) from "elevated volume at 3 AM Saturday" (anomalous).
Built-in Pillars
Two pillars are included out-of-the-box and cannot be deleted (only toggled on/off).
Tenant pillar
Groups all services belonging to the same tenant_id and computes an 11-dimensional embedding:
| Dim | Name | What it measures | Scale |
|---|---|---|---|
| 0 | error_rate | Aggregate errors / total events | linear, ceiling 1 |
| 1 | volume | Total telemetry event count | log-scale |
| 2 | p99_latency | Average p99 latency across services | linear, ceiling 5000 ms |
| 3 | fatal_count | Fatal/critical log events | log-scale |
| 4 | warn_rate | Warnings / total log events | linear, ceiling 1 |
| 5 | unique_services | Distinct active services | log-scale |
| 6–10 | time features | Hour-of-day + day-of-week cyclic encoding | — |
Key expression: m.tenant_id
Product pillar
Groups services by product_id, replacing unique_services with queue_depth (useful for async / message-driven products):
| Dim | Name | What it measures | Scale |
|---|---|---|---|
| 0 | error_rate | Aggregate error rate | linear, ceiling 1 |
| 1 | volume | Total event volume | log-scale |
| 2 | p99_latency | p99 latency across product services | linear, ceiling 5000 ms |
| 3 | fatal_count | Fatal events | log-scale |
| 4 | warn_rate | Warn rate | linear, ceiling 1 |
| 5 | queue_depth | Peak queue depth | linear, ceiling 10 000 |
| 6–10 | time features | Cyclic time encoding | — |
Key expression: m.product_id
Custom Pillars
You can define any additional grouping dimension — region, availability zone, team, environment, cost centre, etc.
1. Create the ClickHouse tables
Custom pillars require their own embedding and anomaly tables. Run this in ClickHouse before creating the pillar:
-- Replace "region" with your pillar name
CREATE TABLE IF NOT EXISTS infrasage_region_embeddings (
key_value String,
window_timestamp DateTime,
embedding Array(Float32)
) ENGINE = MergeTree()
ORDER BY (key_value, window_timestamp)
TTL window_timestamp + INTERVAL 30 DAY;
CREATE TABLE IF NOT EXISTS infrasage_region_anomaly_scores (
key_value String,
window_timestamp DateTime,
weirdness_score Float64
) ENGINE = MergeTree()
ORDER BY (key_value, window_timestamp)
TTL window_timestamp + INTERVAL 90 DAY;
2. Register the pillar via API
curl -X PUT https://api.infrasage.dev/api/v1/pillars/region/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"key_expr": "mapElement(m.custom_keys, '\''region'\'')",
"vector_table": "infrasage_region_embeddings",
"anomaly_table": "infrasage_region_anomaly_scores",
"enabled": true,
"dims": [
{
"name": "error_rate",
"sql_expr": "toFloat64(sum(ts.error_count)) / toFloat64(greatest(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total), 1))",
"ceiling": 1
},
{
"name": "volume",
"sql_expr": "toFloat64(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total))",
"ceiling": 0
},
{
"name": "p99_latency",
"sql_expr": "toFloat64(avgMergeIf(am.avg_value, am.name IN ('\''p99_latency_ms'\'', '\''latency_p99'\'')))",
"ceiling": 5000
},
{
"name": "fatal_count",
"sql_expr": "toFloat64(sum(ts.fatal_count))",
"ceiling": 0
}
]
}'
Or use the Pillar Embeddings UI page → Manage Pillars tab → New custom pillar.
3. Add the custom key to the service map
curl -X PUT https://api.infrasage.dev/api/v1/pillars/service-map \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"entries": [
{
"service_id": "payment-service",
"tenant_id": "acme-corp",
"product_id": "checkout",
"custom_keys": { "region": "ap-south-1" }
},
{
"service_id": "auth-service",
"tenant_id": "acme-corp",
"product_id": "platform",
"custom_keys": { "region": "us-east-1" }
}
]
}'
Key Expression Reference
The key_expr is a ClickHouse SQL expression evaluated in the context of the pillar query. Three table aliases are in scope:
| Alias | Table | Common columns |
|---|---|---|
ts | infrasage_telemetry_signals | service_id, error_count, log_total, trace_total, metric_total, event_total, fatal_count, warn_count |
m | infrasage_service_pillar_map | tenant_id, product_id, custom_keys Map(String,String) |
am | infrasage_aggregated_metrics | name, avg_value, max_value, min_value |
Common key expressions:
-- Built-ins:
m.tenant_id
m.product_id
-- Custom keys from the service map:
mapElement(m.custom_keys, 'region')
mapElement(m.custom_keys, 'availability_zone')
mapElement(m.custom_keys, 'team')
mapElement(m.custom_keys, 'env')
:::caution SQL Safety
Key expressions and dimension SQL are validated server-side. DDL/DML keywords (DROP, DELETE, INSERT, UPDATE, ALTER, CREATE, TRUNCATE) and comment sequences (--, /*) are rejected. Only admin-role users can create or update custom pillars.
:::
Dimension SQL Reference
Ceiling values
| Ceiling | Normalization | Use for |
|---|---|---|
0 | Log-scale: log(1 + value) / log(1 + 1e6) | Unbounded counters — volume, fatal_count |
> 0 | Linear: min(value, ceiling) / ceiling | Rates (ceiling 1), latency ms (5000), queue depth (10 000) |
Common SQL expressions
-- Error rate (0–1)
toFloat64(sum(ts.error_count)) / toFloat64(greatest(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total), 1))
-- Total volume (log-scale, ceiling 0)
toFloat64(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total))
-- Aggregate p99 latency (ceiling 5000)
toFloat64(avgMergeIf(am.avg_value, am.name IN ('p99_latency_ms', 'latency_p99', 'p99_latency', 'http_request_duration_p99')))
-- Fatal event count (log-scale, ceiling 0)
toFloat64(sum(ts.fatal_count))
-- Warn rate (0–1)
toFloat64(sum(ts.warn_count)) / toFloat64(greatest(sum(ts.log_total), 1))
-- Unique active services (log-scale, ceiling 0)
toFloat64(uniq(ts.service_id))
-- Peak queue depth for this group (ceiling 10000)
toFloat64(avgMergeIf(am.max_value, am.name IN ('queue_depth', 'queue_size')))
-- Custom metric average
toFloat64(avgMergeIf(am.avg_value, am.name = 'my_custom_metric'))
Weirdness Score
Each pillar key (e.g. tenant acme-corp) receives a weirdness score between 0 and 1 every minute:
| Score | Interpretation |
|---|---|
| 0.0–0.4 | Normal — within expected behavioural range |
| 0.4–0.7 | Elevated — worth investigating |
| 0.7–1.0 | Anomalous — likely an incident affecting this tenant/product |
The score is the cosine distance between the current embedding and a rolling baseline computed from the previous 30-day window using the HNSW index.
API Reference
| Method | Path | Description |
|---|---|---|
GET | /api/v1/pillars | List all registered pillars |
GET | /api/v1/pillars/{name}/config | Get pillar config and dims |
PUT | /api/v1/pillars/{name}/config | Create or update a pillar (built-ins: enabled toggle only) |
DELETE | /api/v1/pillars/{name} | Soft-delete a custom pillar |
GET | /api/v1/pillars/{name}/anomalies | Query weirdness scores |
GET | /api/v1/pillars/service-map | List service→pillar-key mappings |
PUT | /api/v1/pillars/service-map | Bulk upsert service mappings |
Query anomaly scores
# Last 24 hours for the tenant pillar, filtered to one tenant
GET /api/v1/pillars/tenant/anomalies?hours=24&key=acme-corp&limit=100
Response:
{
"pillar": "tenant",
"anomalies": [
{ "key": "acme-corp", "window": "2026-05-13T10:00:00Z", "score": 0.82 },
{ "key": "beta-inc", "window": "2026-05-13T10:00:00Z", "score": 0.23 }
],
"count": 2,
"lookback_hours": 24
}
UI — Pillar Embeddings Page
Navigate to Pillar Embeddings in the sidebar (operator role required).
Anomaly Scores tab
- Select any enabled pillar using the pill buttons
- Filter by look-back window (1 h to 7 d) and optional key filter
- A dim strip shows all embedding dimensions including the 5 automatic time features
- Score bars color-code anomalies: green (low) → yellow (medium) → red (high)
Service Map tab
- View all current service→tenant/product/custom_key mappings
- Bulk import via CSV: one line per service in the format
service_id, tenant_id, product_id - Changes take effect on the next vectorizer tick (≤ 60 s)
Manage Pillars tab
- Toggle built-in pillars on or off
- Create custom pillars with an inline form
- Delete custom pillars (built-ins are protected)
Service Map
Services must be explicitly mapped before they contribute to pillar embeddings. A service with no mapping is silently skipped by the pillar query.
See the Service Pillar Map guide for bulk import and automation strategies.
Performance Notes
- Pillar window processing runs in parallel goroutines alongside the service-level vectorizer, adding minimal latency to the pipeline
- Each pillar query hits ClickHouse with an INNER JOIN on
infrasage_service_pillar_map— keep the service map lean and indexed - For tenants with thousands of services, ClickHouse's columnar aggregation handles the GROUP BY efficiently; no additional tuning is required under 100 k events/min per window
- Embeddings and scores are written with
async_insert = 1to batch ClickHouse writes
Troubleshooting
No anomaly scores appear for a pillar
- Check that the pillar is enabled:
GET /api/v1/pillars/{name}/config→"enabled": true - Check that services are mapped:
GET /api/v1/pillars/service-map— at least one entry must exist - Verify the ClickHouse tables exist with the correct schema
- Check aiops-engine logs for errors containing the pillar name:
kubectl logs -n infrasage -l app.kubernetes.io/component=aiops-engine | grep -i pillar
Custom pillar returns 400 on save
key_expror a dimsql_exprcontains a disallowed keyword (DROP,DELETE, etc.) or comment sequencevector_table/anomaly_tabledoes not match^[a-zA-Z][a-zA-Z0-9_]{0,63}$- The tables do not yet exist in ClickHouse — create them first (see step 1 above)