Log Anomaly Detection
InfraSage analyses raw log streams to surface three classes of log-based anomaly independently of any metric or trace signal:
| Type | What it catches | Lead time |
|---|---|---|
| Novel template | A new ERROR/FATAL log pattern with no 7-day history | At occurrence |
| Burst pattern | An existing template whose rate spikes above its 24-hour baseline | At occurrence |
| Semantic anomaly | Critical keywords or invalid state transitions in any log line | At occurrence |
All three fire independently of the metric Z-score watchdog, meaning a service can be anomaly-free on metrics but flagged on logs — or vice versa.
How It Works
Raw log events (ingestion gateway)
│
▼
Per-service rate limiter
(drops surplus writes to raw_firehose; template clustering unaffected)
│
▼
Drain template extraction
(log_body → template_id + pattern with <*> wildcards)
│
├──────────────────────────────┐
▼ ▼
infrasage_log_templates infrasage_raw_firehose
(aggregated counts, 30-day TTL) (raw lines, 1-day TTL)
│
Every 60 s (watchdog cycle):
│
├── Novel template detection (ERROR/FATAL with no 7-day history)
├── Burst pattern detection (count > N× 24-hour average)
├── Heuristic semantic scan (critical keywords + state transitions)
│
▼
Claude semantic enrichment (async, 3 s timeout)
→ classify severity, summarise, score similarity vs recent templates
→ suppress if "benign" or similarity ≥ threshold
│
▼
Alert + evidence (patterns, sample lines, extracted fields, correlated services)
Template Extraction — Drain Algorithm
Log lines are clustered into templates using the Drain algorithm (IEEE ICWS 2017). Variable tokens (IDs, timestamps, values) are replaced with <*> wildcards, producing stable template signatures:
"User 42 logged in from 10.0.0.1" → "User <*> logged in from <*>"
"DB query took 234ms for user 99" → "DB query took <*> for user <*>"
Drain parameters: similarity threshold 0.5, max 10,000 clusters, depth 4. Each template is identified by a SHA-256 hash of its normalised token sequence.
Novel Template Detection
A template is novel when it:
- Appeared in the last 5 minutes
- Carries severity
error,err,fatal, orcritical - Has no record in the 7-day baseline window
-- The detection query (simplified)
SELECT template_id, template_pattern, severity, sum(count)
FROM infrasage_log_templates
WHERE service_id = ?
AND window_start >= now() - INTERVAL 5 MINUTE
AND severity IN ('error', 'err', 'fatal', 'critical')
HAVING NOT EXISTS (
SELECT 1 FROM infrasage_log_templates hist
WHERE hist.service_id = ? AND hist.template_id = template_id
AND hist.window_start BETWEEN now() - INTERVAL 7 DAY AND now() - INTERVAL 5 MINUTE
)
Grace Period
New services have no baseline history. To avoid a flood of false-positive alerts during initial deployment, novel-template detection is suppressed until the service has at least LOG_GRACE_PERIOD_MINUTES minutes of template history.
LOG_GRACE_PERIOD_MINUTES=60 # default
Burst Pattern Detection
A template is bursting when its count in the last 5 minutes exceeds LOG_BURST_ZSCORE_THRESHOLD × its average 24-hour rate. This catches scenarios like a known connection-refused template that goes from 2/hour to 2,000/hour.
LOG_BURST_ZSCORE_THRESHOLD=10.0 # default: 10× baseline average
Lowering this to 5.0 makes burst detection more sensitive; raising to 20.0 filters only extreme spikes.
Heuristic Semantic Detection
The existing semantic package required applications to emit event_category="semantic" explicitly — making it inactive for most services. Heuristic semantic detection removes this requirement.
Every watchdog cycle, InfraSage queries infrasage_raw_firehose for log lines matching:
Critical keywords (case-insensitive substring match):
DEGRADED, UNHEALTHY, CIRCUIT_OPEN, CIRCUIT_BREAKER_OPEN,
TIMEOUT_BUDGET_EXCEEDED, LATENCY_SLO_BREACH, OOM_KILLED,
CONNECTION_POOL_EXHAUSTED, QUOTA_EXCEEDED, FAILOVER_TRIGGERED
State transition pattern (ClickHouse match() regex):
(?i)(state|status|mode)\s+(changed?|transitioned?|moved?)\s+(?:from\s+\S+\s+)?to\s+\S+
When either fires, the anomaly is appended to the service's NovelLogPatterns and included in the alert evidence. No application code changes are required.
LOG_SEMANTIC_AUTO_DETECT=true # default; set false to disable
Claude Semantic Enrichment
When a novel template is detected, InfraSage calls Claude Haiku (asynchronously, 3 s timeout) to:
- Classify severity —
benign,warning, orcritical - Summarise the pattern in one sentence for the alert evidence
- Score similarity 0–10 against the 10 most recent templates for this service
If the similarity score meets or exceeds LOG_SEMANTIC_SUPPRESS_THRESHOLD, the alert is suppressed as a semantic duplicate of a pattern the team has already seen. If Claude classifies the pattern as benign, the alert is suppressed entirely.
Detection never blocks on enrichment — if the Claude call times out or fails, the alert fires with structural evidence only.
LOG_SEMANTIC_ENRICH=true # default; set false to disable
LOG_SEMANTIC_SUPPRESS_THRESHOLD=7 # 0–10; ≥7 → suppress as duplicate
Enrichment results are persisted to infrasage_log_template_semantics (30-day TTL):
SELECT service_id, template_id, severity_class, summary, similarity_score
FROM infrasage_log_template_semantics
WHERE service_id = 'payment-api'
ORDER BY created_at DESC
LIMIT 10;
:::tip Cost Claude Haiku is billed at ~$0.25/M input tokens. At 5 novel templates/minute across the fleet, daily cost is under $0.50. Enrichment fires only on novel templates — not every log line. :::
Log Drill-Down in Alerts
Previously, alert evidence for a novel or burst pattern contained only the template string. Now, up to 3 raw log lines matching the pattern are fetched from infrasage_raw_firehose and included directly in the alert evidence.
Example alert evidence:
Novel log pattern: a1b2c3 [fatal]: <*> failed to connect to <*>:5432 after <*> retries
Sample log 1: payment-worker failed to connect to db-primary:5432 after 5 retries
Sample log 2: payment-worker failed to connect to db-primary:5432 after 5 retries
Sample log 3: payment-api failed to connect to db-replica:5432 after 5 retries
Extracted fields: http_status=500, request_id=req-abc123, user_id=user-789
Simultaneous log burst in: auth-service, order-processor
Automatic Field Extraction
InfraSage extracts the following structured fields from raw log lines using pre-compiled regexes:
| Field | Pattern matched |
|---|---|
request_id | request_id=, requestId:, etc. |
user_id | user_id=, userId:, etc. |
tenant_id | tenant_id=, tenantId:, etc. |
error_code | error_code=, errorCode:, etc. |
http_status | Any 3-digit HTTP status code |
trace_id | trace_id=, traceId:, 16–36 hex chars |
span_id | span_id=, spanId:, 8–36 hex chars |
duration_ms | duration=, latency=, elapsed= with a numeric value |
Extracted fields appear in the alert evidence under "Extracted fields:" and help identify the blast radius (which users, tenants, or requests were affected).
Cross-Service Log Correlation
When a novel or burst pattern fires for service A, InfraSage queries infrasage_log_templates for other services that had a simultaneous log burst (>5× their 24-hour average) in the same ±10 minute window. Correlated services are listed in the alert evidence as "Simultaneous log burst in: …".
This surfaces cascade scenarios — for example, a database connection pool exhaustion in db-proxy causing simultaneous burst patterns in payment-api, order-service, and auth-service.
Log Rate Limiting
A chatty service logging at 100k events/s can exhaust ClickHouse raw_firehose disk before the 1-day TTL runs. InfraSage applies a per-service token-bucket rate limiter at the ingestion gateway:
LOG_MAX_RAW_PER_SECOND=1000 # per-service cap on raw_firehose writes
When the limit is exceeded, the write is silently dropped (the log_writes_dropped_total Prometheus counter increments). Template clustering is unaffected — it runs on the full incoming stream before rate limiting is applied.
Set to 0 to disable rate limiting entirely.
Template Lifecycle
infrasage_log_templates grows indefinitely by default. InfraSage runs a daily pruning job at 02:00 UTC that:
- Deletes templates older than 30 days (
window_start < now() - 30 days) - Deletes orphan templates for services that have produced no telemetry in the past 7 days
This prevents stale templates from decommissioned services from polluting burst baselines and Drain similarity calculations.
Alert Evidence Schema
Every log-based anomaly produces the following evidence fields in the infrasage_alerts record:
Novel log pattern: <templateID> [<severity>]: <pattern>
Burst log pattern: <templateID> [<severity>]: <pattern> (<N>x baseline)
Sample log 1: <raw log line>
Sample log 2: <raw log line>
Sample log 3: <raw log line>
Extracted fields: <key>=<value>, <key>=<value>
Simultaneous log burst in: <service1>, <service2>
Configuration Reference
All variables apply to the AIops Engine service, except LOG_MAX_RAW_PER_SECOND which applies to the Ingestion Gateway.
| Variable | Default | Description |
|---|---|---|
LOG_GRACE_PERIOD_MINUTES | 60 | Minutes of template history required before novel-template alerts fire. Prevents false-positive floods on new deployments. |
LOG_BURST_ZSCORE_THRESHOLD | 10.0 | Multiplier above 24-hour average before a template is flagged as bursting. Lower = more sensitive. |
LOG_SEMANTIC_AUTO_DETECT | true | Enable heuristic keyword + state-transition detection from raw log lines. Requires ANTHROPIC_API_KEY to be set for the AIops Engine. |
LOG_SEMANTIC_ENRICH | true | Enable Claude Haiku enrichment of novel templates (severity classification, summary, duplicate detection). Requires ANTHROPIC_API_KEY. |
LOG_SEMANTIC_SUPPRESS_THRESHOLD | 7 | Similarity score (0–10) above which a novel template is suppressed as a semantic duplicate of a recent pattern. |
LOG_MAX_RAW_PER_SECOND | 1000 | Per-service cap on raw_firehose writes at the Ingestion Gateway. Set to 0 to disable. |
ClickHouse Tables
| Table | TTL | Contents |
|---|---|---|
infrasage_log_templates | 30 days (pruned daily) | Template ID, pattern, severity, window counts per service |
infrasage_raw_firehose | 1 day | Full raw log lines per service — source for drill-down samples |
infrasage_log_template_semantics | 30 days | Claude enrichment results: severity class, summary, similarity score |
Related
- Anomaly Detection — parallel metric-based watchdog
- Causal Anomaly Detection — pre-fault structural detection
- Root Cause Analysis — triggered by confirmed incidents from any detection system
- Telemetry Quality Scoring — data quality system that surfaces log coverage gaps