Skip to main content

Pillar Embeddings

Pillar Embeddings generalise InfraSage's ML anomaly detection beyond the individual service boundary. Instead of asking "Is this service behaving unusually?", pillar embeddings ask "Is this tenant, product, or any other business dimension you define behaving unusually — in aggregate?"

This gives you a second layer of visibility that service-level anomalies can miss: a single service may look fine in isolation, but 50 services belonging to one tenant may collectively show an elevated error rate and rising latency that only becomes visible when you aggregate them.


Core Idea

Every minute, for each active pillar:

Services with a mapping in the service-map

▼ GROUP BY <key_expr>
One embedding vector per key value
(e.g. one vector per tenant_id)

▼ cosine distance vs. baseline
Weirdness score [0, 1] per key


Stored in ClickHouse → queryable via API + UI

Each embedding vector has N behavioural dims + 5 cyclic time features (time_sin, time_cos, day_sin, day_cos, is_weekend). The time features allow the model to distinguish "elevated volume on Monday morning" (normal) from "elevated volume at 3 AM Saturday" (anomalous).


Built-in Pillars

Two pillars are included out-of-the-box and cannot be deleted (only toggled on/off).

Tenant pillar

Groups all services belonging to the same tenant_id and computes an 11-dimensional embedding:

DimNameWhat it measuresScale
0error_rateAggregate errors / total eventslinear, ceiling 1
1volumeTotal telemetry event countlog-scale
2p99_latencyAverage p99 latency across serviceslinear, ceiling 5000 ms
3fatal_countFatal/critical log eventslog-scale
4warn_rateWarnings / total log eventslinear, ceiling 1
5unique_servicesDistinct active serviceslog-scale
6–10time featuresHour-of-day + day-of-week cyclic encoding

Key expression: m.tenant_id

Product pillar

Groups services by product_id, replacing unique_services with queue_depth (useful for async / message-driven products):

DimNameWhat it measuresScale
0error_rateAggregate error ratelinear, ceiling 1
1volumeTotal event volumelog-scale
2p99_latencyp99 latency across product serviceslinear, ceiling 5000 ms
3fatal_countFatal eventslog-scale
4warn_rateWarn ratelinear, ceiling 1
5queue_depthPeak queue depthlinear, ceiling 10 000
6–10time featuresCyclic time encoding

Key expression: m.product_id


Custom Pillars

You can define any additional grouping dimension — region, availability zone, team, environment, cost centre, etc.

1. Create the ClickHouse tables

Custom pillars require their own embedding and anomaly tables. Run this in ClickHouse before creating the pillar:

-- Replace "region" with your pillar name
CREATE TABLE IF NOT EXISTS infrasage_region_embeddings (
key_value String,
window_timestamp DateTime,
embedding Array(Float32)
) ENGINE = MergeTree()
ORDER BY (key_value, window_timestamp)
TTL window_timestamp + INTERVAL 30 DAY;

CREATE TABLE IF NOT EXISTS infrasage_region_anomaly_scores (
key_value String,
window_timestamp DateTime,
weirdness_score Float64
) ENGINE = MergeTree()
ORDER BY (key_value, window_timestamp)
TTL window_timestamp + INTERVAL 90 DAY;

2. Register the pillar via API

curl -X PUT https://api.infrasage.dev/api/v1/pillars/region/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"key_expr": "mapElement(m.custom_keys, '\''region'\'')",
"vector_table": "infrasage_region_embeddings",
"anomaly_table": "infrasage_region_anomaly_scores",
"enabled": true,
"dims": [
{
"name": "error_rate",
"sql_expr": "toFloat64(sum(ts.error_count)) / toFloat64(greatest(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total), 1))",
"ceiling": 1
},
{
"name": "volume",
"sql_expr": "toFloat64(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total))",
"ceiling": 0
},
{
"name": "p99_latency",
"sql_expr": "toFloat64(avgMergeIf(am.avg_value, am.name IN ('\''p99_latency_ms'\'', '\''latency_p99'\'')))",
"ceiling": 5000
},
{
"name": "fatal_count",
"sql_expr": "toFloat64(sum(ts.fatal_count))",
"ceiling": 0
}
]
}'

Or use the Pillar Embeddings UI page → Manage Pillars tab → New custom pillar.

3. Add the custom key to the service map

curl -X PUT https://api.infrasage.dev/api/v1/pillars/service-map \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"entries": [
{
"service_id": "payment-service",
"tenant_id": "acme-corp",
"product_id": "checkout",
"custom_keys": { "region": "ap-south-1" }
},
{
"service_id": "auth-service",
"tenant_id": "acme-corp",
"product_id": "platform",
"custom_keys": { "region": "us-east-1" }
}
]
}'

Key Expression Reference

The key_expr is a ClickHouse SQL expression evaluated in the context of the pillar query. Three table aliases are in scope:

AliasTableCommon columns
tsinfrasage_telemetry_signalsservice_id, error_count, log_total, trace_total, metric_total, event_total, fatal_count, warn_count
minfrasage_service_pillar_maptenant_id, product_id, custom_keys Map(String,String)
aminfrasage_aggregated_metricsname, avg_value, max_value, min_value

Common key expressions:

-- Built-ins:
m.tenant_id
m.product_id

-- Custom keys from the service map:
mapElement(m.custom_keys, 'region')
mapElement(m.custom_keys, 'availability_zone')
mapElement(m.custom_keys, 'team')
mapElement(m.custom_keys, 'env')

:::caution SQL Safety Key expressions and dimension SQL are validated server-side. DDL/DML keywords (DROP, DELETE, INSERT, UPDATE, ALTER, CREATE, TRUNCATE) and comment sequences (--, /*) are rejected. Only admin-role users can create or update custom pillars. :::


Dimension SQL Reference

Ceiling values

CeilingNormalizationUse for
0Log-scale: log(1 + value) / log(1 + 1e6)Unbounded counters — volume, fatal_count
> 0Linear: min(value, ceiling) / ceilingRates (ceiling 1), latency ms (5000), queue depth (10 000)

Common SQL expressions

-- Error rate (0–1)
toFloat64(sum(ts.error_count)) / toFloat64(greatest(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total), 1))

-- Total volume (log-scale, ceiling 0)
toFloat64(sum(ts.log_total + ts.trace_total + ts.metric_total + ts.event_total))

-- Aggregate p99 latency (ceiling 5000)
toFloat64(avgMergeIf(am.avg_value, am.name IN ('p99_latency_ms', 'latency_p99', 'p99_latency', 'http_request_duration_p99')))

-- Fatal event count (log-scale, ceiling 0)
toFloat64(sum(ts.fatal_count))

-- Warn rate (0–1)
toFloat64(sum(ts.warn_count)) / toFloat64(greatest(sum(ts.log_total), 1))

-- Unique active services (log-scale, ceiling 0)
toFloat64(uniq(ts.service_id))

-- Peak queue depth for this group (ceiling 10000)
toFloat64(avgMergeIf(am.max_value, am.name IN ('queue_depth', 'queue_size')))

-- Custom metric average
toFloat64(avgMergeIf(am.avg_value, am.name = 'my_custom_metric'))

Weirdness Score

Each pillar key (e.g. tenant acme-corp) receives a weirdness score between 0 and 1 every minute:

ScoreInterpretation
0.0–0.4Normal — within expected behavioural range
0.4–0.7Elevated — worth investigating
0.7–1.0Anomalous — likely an incident affecting this tenant/product

The score is the cosine distance between the current embedding and a rolling baseline computed from the previous 30-day window using the HNSW index.


API Reference

MethodPathDescription
GET/api/v1/pillarsList all registered pillars
GET/api/v1/pillars/{name}/configGet pillar config and dims
PUT/api/v1/pillars/{name}/configCreate or update a pillar (built-ins: enabled toggle only)
DELETE/api/v1/pillars/{name}Soft-delete a custom pillar
GET/api/v1/pillars/{name}/anomaliesQuery weirdness scores
GET/api/v1/pillars/service-mapList service→pillar-key mappings
PUT/api/v1/pillars/service-mapBulk upsert service mappings

Query anomaly scores

# Last 24 hours for the tenant pillar, filtered to one tenant
GET /api/v1/pillars/tenant/anomalies?hours=24&key=acme-corp&limit=100

Response:

{
"pillar": "tenant",
"anomalies": [
{ "key": "acme-corp", "window": "2026-05-13T10:00:00Z", "score": 0.82 },
{ "key": "beta-inc", "window": "2026-05-13T10:00:00Z", "score": 0.23 }
],
"count": 2,
"lookback_hours": 24
}

UI — Pillar Embeddings Page

Navigate to Pillar Embeddings in the sidebar (operator role required).

Anomaly Scores tab

  • Select any enabled pillar using the pill buttons
  • Filter by look-back window (1 h to 7 d) and optional key filter
  • A dim strip shows all embedding dimensions including the 5 automatic time features
  • Score bars color-code anomalies: green (low) → yellow (medium) → red (high)

Service Map tab

  • View all current service→tenant/product/custom_key mappings
  • Bulk import via CSV: one line per service in the format service_id, tenant_id, product_id
  • Changes take effect on the next vectorizer tick (≤ 60 s)

Manage Pillars tab

  • Toggle built-in pillars on or off
  • Create custom pillars with an inline form
  • Delete custom pillars (built-ins are protected)

Service Map

Services must be explicitly mapped before they contribute to pillar embeddings. A service with no mapping is silently skipped by the pillar query.

See the Service Pillar Map guide for bulk import and automation strategies.


Performance Notes

  • Pillar window processing runs in parallel goroutines alongside the service-level vectorizer, adding minimal latency to the pipeline
  • Each pillar query hits ClickHouse with an INNER JOIN on infrasage_service_pillar_map — keep the service map lean and indexed
  • For tenants with thousands of services, ClickHouse's columnar aggregation handles the GROUP BY efficiently; no additional tuning is required under 100 k events/min per window
  • Embeddings and scores are written with async_insert = 1 to batch ClickHouse writes

Troubleshooting

No anomaly scores appear for a pillar

  1. Check that the pillar is enabled: GET /api/v1/pillars/{name}/config"enabled": true
  2. Check that services are mapped: GET /api/v1/pillars/service-map — at least one entry must exist
  3. Verify the ClickHouse tables exist with the correct schema
  4. Check aiops-engine logs for errors containing the pillar name:
    kubectl logs -n infrasage -l app.kubernetes.io/component=aiops-engine | grep -i pillar

Custom pillar returns 400 on save

  • key_expr or a dim sql_expr contains a disallowed keyword (DROP, DELETE, etc.) or comment sequence
  • vector_table / anomaly_table does not match ^[a-zA-Z][a-zA-Z0-9_]{0,63}$
  • The tables do not yet exist in ClickHouse — create them first (see step 1 above)