Log Anomaly Detection

InfraSage analyses raw log streams to surface three classes of log-based anomaly independently of any metric or trace signal:

Type	What it catches	Lead time
Novel template	A new ERROR/FATAL log pattern with no 7-day history	At occurrence
Burst pattern	An existing template whose rate spikes above its 24-hour baseline	At occurrence
Semantic anomaly	Critical keywords or invalid state transitions in any log line	At occurrence

All three fire independently of the metric Z-score watchdog, meaning a service can be anomaly-free on metrics but flagged on logs — or vice versa.

How It Works

Raw log events (ingestion gateway)
           │
           ▼
  Per-service rate limiter
  (drops surplus writes to raw_firehose; template clustering unaffected)
           │
           ▼
  Drain template extraction
  (log_body → template_id + pattern with <*> wildcards)
           │
           ├──────────────────────────────┐
           ▼                              ▼
  infrasage_log_templates         infrasage_raw_firehose
  (aggregated counts, 30-day TTL)  (raw lines, 1-day TTL)
           │
  Every 60 s (watchdog cycle):
           │
           ├── Novel template detection  (ERROR/FATAL with no 7-day history)
           ├── Burst pattern detection   (count > N× 24-hour average)
           ├── Heuristic semantic scan   (critical keywords + state transitions)
           │
           ▼
  Claude semantic enrichment (async, 3 s timeout)
  → classify severity, summarise, score similarity vs recent templates
  → suppress if "benign" or similarity ≥ threshold
           │
           ▼
  Alert + evidence (patterns, sample lines, extracted fields, correlated services)

Template Extraction — Drain Algorithm

Log lines are clustered into templates using the Drain algorithm (IEEE ICWS 2017). Variable tokens (IDs, timestamps, values) are replaced with <*> wildcards, producing stable template signatures:

"User 42 logged in from 10.0.0.1"   → "User <*> logged in from <*>"
"DB query took 234ms for user 99"   → "DB query took <*> for user <*>"

Drain parameters: similarity threshold 0.5, max 10,000 clusters, depth 4. Each template is identified by a SHA-256 hash of its normalised token sequence.

Novel Template Detection

A template is novel when it:

Appeared in the last 5 minutes
Carries severity error, err, fatal, or critical
Has no record in the 7-day baseline window

-- The detection query (simplified)
SELECT template_id, template_pattern, severity, sum(count)
FROM infrasage_log_templates
WHERE service_id = ?
  AND window_start >= now() - INTERVAL 5 MINUTE
  AND severity IN ('error', 'err', 'fatal', 'critical')
HAVING NOT EXISTS (
  SELECT 1 FROM infrasage_log_templates hist
  WHERE hist.service_id = ? AND hist.template_id = template_id
    AND hist.window_start BETWEEN now() - INTERVAL 7 DAY AND now() - INTERVAL 5 MINUTE
)

Grace Period

New services have no baseline history. To avoid a flood of false-positive alerts during initial deployment, novel-template detection is suppressed until the service has at least LOG_GRACE_PERIOD_MINUTES minutes of template history.

LOG_GRACE_PERIOD_MINUTES=60   # default

Burst Pattern Detection

A template is bursting when its count in the last 5 minutes exceeds LOG_BURST_ZSCORE_THRESHOLD × its average 24-hour rate. This catches scenarios like a known connection-refused template that goes from 2/hour to 2,000/hour.

LOG_BURST_ZSCORE_THRESHOLD=10.0   # default: 10× baseline average

Lowering this to 5.0 makes burst detection more sensitive; raising to 20.0 filters only extreme spikes.

Heuristic Semantic Detection

The existing semantic package required applications to emit event_category="semantic" explicitly — making it inactive for most services. Heuristic semantic detection removes this requirement.

Every watchdog cycle, InfraSage queries infrasage_raw_firehose for log lines matching:

Critical keywords (case-insensitive substring match):

DEGRADED, UNHEALTHY, CIRCUIT_OPEN, CIRCUIT_BREAKER_OPEN,
TIMEOUT_BUDGET_EXCEEDED, LATENCY_SLO_BREACH, OOM_KILLED,
CONNECTION_POOL_EXHAUSTED, QUOTA_EXCEEDED, FAILOVER_TRIGGERED

State transition pattern (ClickHouse match() regex):

(?i)(state|status|mode)\s+(changed?|transitioned?|moved?)\s+(?:from\s+\S+\s+)?to\s+\S+

When either fires, the anomaly is appended to the service's NovelLogPatterns and included in the alert evidence. No application code changes are required.

LOG_SEMANTIC_AUTO_DETECT=true   # default; set false to disable

Claude Semantic Enrichment

When a novel template is detected, InfraSage calls Claude Haiku (asynchronously, 3 s timeout) to:

Classify severity — benign, warning, or critical
Summarise the pattern in one sentence for the alert evidence
Score similarity 0–10 against the 10 most recent templates for this service

If the similarity score meets or exceeds LOG_SEMANTIC_SUPPRESS_THRESHOLD, the alert is suppressed as a semantic duplicate of a pattern the team has already seen. If Claude classifies the pattern as benign, the alert is suppressed entirely.

Detection never blocks on enrichment — if the Claude call times out or fails, the alert fires with structural evidence only.

LOG_SEMANTIC_ENRICH=true               # default; set false to disable
LOG_SEMANTIC_SUPPRESS_THRESHOLD=7      # 0–10; ≥7 → suppress as duplicate

Enrichment results are persisted to infrasage_log_template_semantics (30-day TTL):

SELECT service_id, template_id, severity_class, summary, similarity_score
FROM infrasage_log_template_semantics
WHERE service_id = 'payment-api'
ORDER BY created_at DESC
LIMIT 10;

:::tip Cost Claude Haiku is billed at ~$0.25/M input tokens. At 5 novel templates/minute across the fleet, daily cost is under $0.50. Enrichment fires only on novel templates — not every log line. :::

Log Drill-Down in Alerts

Previously, alert evidence for a novel or burst pattern contained only the template string. Now, up to 3 raw log lines matching the pattern are fetched from infrasage_raw_firehose and included directly in the alert evidence.

Example alert evidence:

Novel log pattern: a1b2c3 [fatal]: <*> failed to connect to <*>:5432 after <*> retries
Sample log 1: payment-worker failed to connect to db-primary:5432 after 5 retries
Sample log 2: payment-worker failed to connect to db-primary:5432 after 5 retries
Sample log 3: payment-api failed to connect to db-replica:5432 after 5 retries
Extracted fields: http_status=500, request_id=req-abc123, user_id=user-789
Simultaneous log burst in: auth-service, order-processor

Automatic Field Extraction

InfraSage extracts the following structured fields from raw log lines using pre-compiled regexes:

Field	Pattern matched
`request_id`	`request_id=`, `requestId:`, etc.
`user_id`	`user_id=`, `userId:`, etc.
`tenant_id`	`tenant_id=`, `tenantId:`, etc.
`error_code`	`error_code=`, `errorCode:`, etc.
`http_status`	Any 3-digit HTTP status code
`trace_id`	`trace_id=`, `traceId:`, 16–36 hex chars
`span_id`	`span_id=`, `spanId:`, 8–36 hex chars
`duration_ms`	`duration=`, `latency=`, `elapsed=` with a numeric value

Extracted fields appear in the alert evidence under "Extracted fields:" and help identify the blast radius (which users, tenants, or requests were affected).

Cross-Service Log Correlation

When a novel or burst pattern fires for service A, InfraSage queries infrasage_log_templates for other services that had a simultaneous log burst (>5× their 24-hour average) in the same ±10 minute window. Correlated services are listed in the alert evidence as "Simultaneous log burst in: …".

This surfaces cascade scenarios — for example, a database connection pool exhaustion in db-proxy causing simultaneous burst patterns in payment-api, order-service, and auth-service.

Log Rate Limiting

A chatty service logging at 100k events/s can exhaust ClickHouse raw_firehose disk before the 1-day TTL runs. InfraSage applies a per-service token-bucket rate limiter at the ingestion gateway:

LOG_MAX_RAW_PER_SECOND=1000   # per-service cap on raw_firehose writes

When the limit is exceeded, the write is silently dropped (the log_writes_dropped_total Prometheus counter increments). Template clustering is unaffected — it runs on the full incoming stream before rate limiting is applied.

Set to 0 to disable rate limiting entirely.

Template Lifecycle

infrasage_log_templates grows indefinitely by default. InfraSage runs a daily pruning job at 02:00 UTC that:

Deletes templates older than 30 days (window_start < now() - 30 days)
Deletes orphan templates for services that have produced no telemetry in the past 7 days

This prevents stale templates from decommissioned services from polluting burst baselines and Drain similarity calculations.

Alert Evidence Schema

Every log-based anomaly produces the following evidence fields in the infrasage_alerts record:

Novel log pattern: <templateID> [<severity>]: <pattern>
Burst log pattern: <templateID> [<severity>]: <pattern> (<N>x baseline)
Sample log 1: <raw log line>
Sample log 2: <raw log line>
Sample log 3: <raw log line>
Extracted fields: <key>=<value>, <key>=<value>
Simultaneous log burst in: <service1>, <service2>

Configuration Reference

All variables apply to the AIops Engine service, except LOG_MAX_RAW_PER_SECOND which applies to the Ingestion Gateway.

Variable	Default	Description
`LOG_GRACE_PERIOD_MINUTES`	`60`	Minutes of template history required before novel-template alerts fire. Prevents false-positive floods on new deployments.
`LOG_BURST_ZSCORE_THRESHOLD`	`10.0`	Multiplier above 24-hour average before a template is flagged as bursting. Lower = more sensitive.
`LOG_SEMANTIC_AUTO_DETECT`	`true`	Enable heuristic keyword + state-transition detection from raw log lines. Requires `ANTHROPIC_API_KEY` to be set for the AIops Engine.
`LOG_SEMANTIC_ENRICH`	`true`	Enable Claude Haiku enrichment of novel templates (severity classification, summary, duplicate detection). Requires `ANTHROPIC_API_KEY`.
`LOG_SEMANTIC_SUPPRESS_THRESHOLD`	`7`	Similarity score (0–10) above which a novel template is suppressed as a semantic duplicate of a recent pattern.
`LOG_MAX_RAW_PER_SECOND`	`1000`	Per-service cap on raw_firehose writes at the Ingestion Gateway. Set to `0` to disable.

ClickHouse Tables

Table	TTL	Contents
`infrasage_log_templates`	30 days (pruned daily)	Template ID, pattern, severity, window counts per service
`infrasage_raw_firehose`	1 day	Full raw log lines per service — source for drill-down samples
`infrasage_log_template_semantics`	30 days	Claude enrichment results: severity class, summary, similarity score

Anomaly Detection — parallel metric-based watchdog
Causal Anomaly Detection — pre-fault structural detection
Root Cause Analysis — triggered by confirmed incidents from any detection system
Telemetry Quality Scoring — data quality system that surfaces log coverage gaps

How It Works​

Template Extraction — Drain Algorithm​

Novel Template Detection​

Grace Period​

Burst Pattern Detection​

Heuristic Semantic Detection​

Claude Semantic Enrichment​

Log Drill-Down in Alerts​

Automatic Field Extraction​

Cross-Service Log Correlation​

Log Rate Limiting​

Template Lifecycle​

Alert Evidence Schema​

Configuration Reference​

ClickHouse Tables​

Related​