Environment Variables

All InfraSage services are configured via environment variables. Set them in your .env file (Docker Compose), Kubernetes Secret, or your deployment's environment configuration.

Core / Shared

Variable	Default	Required	Description
`ENVIRONMENT`	`development`	No	Deployment environment: `development`, `staging`, `production`
`LOG_LEVEL`	`info`	No	Log verbosity: `debug`, `info`, `warn`, `error`

ClickHouse

Variable	Default	Required	Description
`CLICKHOUSE_ADDR`	`localhost:9000`	Yes	ClickHouse native protocol address
`CLICKHOUSE_DB`	`infrasage`	Yes	Database name
`CLICKHOUSE_USER`	`infrasage`	Yes	Database username
`CLICKHOUSE_PASSWORD`	`infrasage-dev`	Yes	Database password. Change in production.

Kafka / Redpanda

Variable	Default	Required	Description
`REDPANDA_BROKERS`	`localhost:9092`	Yes	Comma-separated broker addresses
`KAFKA_TOPIC`	`raw-telemetry`	No	Telemetry topic name
`KAFKA_PARTITIONS`	`3`	No	Number of topic partitions

Ingestion Gateway

Variable	Default	Required	Description
`GATEWAY_HTTP_PORT`	`8080`	No	HTTP listener port
`GATEWAY_METRICS_PORT`	`9090`	No	Prometheus metrics port
`INGESTION_WORKER_COUNT`	`4`	No	Parallel Kafka publish workers
`BATCH_FIREHOSE_SIZE`	`10000`	No	Max records per ClickHouse batch write
`BATCH_FIREHOSE_TIMEOUT_MS`	`5000`	No	Max wait time before flushing a batch (ms)
`BATCH_EXEMPLAR_SIZE`	`2000`	No	Max exemplar records per batch
`BATCH_EXEMPLAR_TIMEOUT_MS`	`10000`	No	Max wait time for exemplar batch flush (ms)

Telemetry Operator

Variable	Default	Required	Description
`OPERATOR_HTTP_PORT`	`8081`	No	HTTP listener port
`OPERATOR_METRICS_PORT`	`9091`	No	Prometheus metrics port
`OPERATOR_WORKER_COUNT`	`2`	No	Number of Kafka consumer workers

AIops Engine

Variable	Default	Required	Description
`AIOPS_HTTP_PORT`	`8080`	No	HTTP listener port
`AIOPS_METRICS_PORT`	`9092`	No	Prometheus metrics port
`ALERTMANAGER_WEBHOOK_PORT`	`9093`	No	Port for Prometheus Alertmanager webhook
`WATCHDOG_INTERVAL_SECONDS`	`60`	No	How often the anomaly watchdog polls ClickHouse
`WATCHDOG_Z_SCORE_THRESHOLD`	`3.0`	No	Z-score threshold for anomaly declaration. Lower = more sensitive.
`WATCHDOG_RCA_COOLDOWN_MINUTES`	`15`	No	Minimum minutes between RCA runs for the same service/metric
`VECTORIZER_INTERVAL_SECONDS`	`60`	No	How often to rebuild the HNSW vector index

Log Anomaly Detection (AIops Engine)

Variable	Default	Description
`LOG_GRACE_PERIOD_MINUTES`	`60`	Minutes of template history required before novel-template alerts fire for a service. Prevents false-positive floods on new deployments.
`LOG_BURST_ZSCORE_THRESHOLD`	`10.0`	Multiplier above 24-hour average before an existing template is flagged as bursting. Lower values = more sensitive.
`LOG_SEMANTIC_AUTO_DETECT`	`true`	Enable heuristic detection of critical keywords and state transitions from raw log lines, without requiring application instrumentation.
`LOG_SEMANTIC_ENRICH`	`true`	Enable Claude Haiku enrichment of novel templates — classifies severity, writes a summary, and detects semantic duplicates. Requires `ANTHROPIC_API_KEY`.
`LOG_SEMANTIC_SUPPRESS_THRESHOLD`	`7`	Similarity score (0–10) returned by Claude above which a novel template is suppressed as a duplicate of a recent pattern.

Log Rate Limiting (Ingestion Gateway)

Variable	Default	Description
`LOG_MAX_RAW_PER_SECOND`	`1000`	Per-service cap on raw log writes to `infrasage_raw_firehose`. Template clustering is unaffected. Set to `0` to disable.

LLM / AI

Variable	Default	Required	Description
`LLM_PROVIDER`	`anthropic`	No	LLM backend. Currently only `anthropic` is supported.
`ANTHROPIC_API_KEY`	—	Yes (for RCA)	Anthropic API key. Get one at console.anthropic.com.
`ANTHROPIC_MODEL`	`claude-opus-4-6`	No	Claude model to use for RCA analysis

tip

Without ANTHROPIC_API_KEY, anomaly detection and alerting still work. Only AI-generated RCA summaries are disabled.

Vector Index (HNSW)

Variable	Default	Required	Description
`VECTOR_HNSW_M`	`16`	No	HNSW graph connectivity. Higher = better recall, more memory.
`VECTOR_HNSW_EF_CONSTRUCTION`	`200`	No	Build-time search width. Higher = better index quality, slower build.
`VECTOR_HNSW_EF_SEARCH`	`50`	No	Query-time search width. Higher = better recall, slower queries.

For large-scale deployments (500+ services), set VECTOR_HNSW_M=24 and VECTOR_HNSW_EF_CONSTRUCTION=400.

Integrations

Slack

Variable	Default	Description
`SLACK_WEBHOOK_URL`	—	Incoming webhook URL for alert notifications
`SLACK_BOT_TOKEN`	—	Bot token for interactive approval flows (optional)
`SLACK_CHANNEL`	`#alerts`	Default alert channel

PagerDuty

Variable	Default	Description
`PAGERDUTY_API_TOKEN`	—	PagerDuty API token
`PAGERDUTY_SERVICE_KEY`	—	Integration key for incident creation

Jira

Variable	Default	Description
`JIRA_API_TOKEN`	—	Jira API token
`JIRA_DOMAIN`	—	Your Jira domain (e.g. `mycompany.atlassian.net`)
`JIRA_PROJECT_KEY`	`OPS`	Project key for auto-created tickets
`JIRA_USERNAME`	—	Jira account email/username

Microsoft Teams

Variable	Default	Description
`TEAMS_WEBHOOK_URL`	—	Teams incoming webhook URL

AWS CloudWatch

Variable	Default	Description
`AWS_REGION`	—	AWS region (e.g. `us-east-1`)
`AWS_ACCESS_KEY_ID`	—	AWS access key (or use IAM role)
`AWS_SECRET_ACCESS_KEY`	—	AWS secret key (or use IAM role)
`CLOUDWATCH_POLL_INTERVAL_SECONDS`	`60`	How often to poll CloudWatch metrics

Grafana

Variable	Default	Description
`GF_SECURITY_ADMIN_PASSWORD`	`admin`	Grafana admin password. Change in production.
`GF_SECURITY_ADMIN_USER`	`admin`	Grafana admin username

Complete `.env` Example

# Core
ENVIRONMENT=production
LOG_LEVEL=info

# ClickHouse
CLICKHOUSE_ADDR=clickhouse:9000
CLICKHOUSE_DB=infrasage
CLICKHOUSE_USER=infrasage
CLICKHOUSE_PASSWORD=CHANGE_ME_SECURE_PASSWORD

# Kafka
REDPANDA_BROKERS=redpanda:29092

# Ingestion tuning (medium scale: 50-500 services)
INGESTION_WORKER_COUNT=16
BATCH_FIREHOSE_SIZE=50000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=10000

# AIops Engine
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15

# AI
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-YOUR_KEY_HERE

# Integrations
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK
PAGERDUTY_API_TOKEN=your-pd-token
JIRA_API_TOKEN=your-jira-token
JIRA_DOMAIN=mycompany.atlassian.net
JIRA_USERNAME=[email protected]

# Grafana
GF_SECURITY_ADMIN_PASSWORD=CHANGE_ME_SECURE_PASSWORD

Core / Shared​

ClickHouse​

Kafka / Redpanda​

Ingestion Gateway​

Telemetry Operator​

AIops Engine​

Log Anomaly Detection (AIops Engine)​

Log Rate Limiting (Ingestion Gateway)​

LLM / AI​

Vector Index (HNSW)​

Integrations​

Slack​

PagerDuty​

Jira​

Microsoft Teams​

AWS CloudWatch​

Grafana​

Complete .env Example​