Skip to main content

Environment Variables

All InfraSage services are configured via environment variables. Set them in your .env file (Docker Compose), Kubernetes Secret, or your deployment's environment configuration.


Core / Shared

VariableDefaultRequiredDescription
ENVIRONMENTdevelopmentNoDeployment environment: development, staging, production
LOG_LEVELinfoNoLog verbosity: debug, info, warn, error

ClickHouse

VariableDefaultRequiredDescription
CLICKHOUSE_ADDRlocalhost:9000YesClickHouse native protocol address
CLICKHOUSE_DBinfrasageYesDatabase name
CLICKHOUSE_USERinfrasageYesDatabase username
CLICKHOUSE_PASSWORDinfrasage-devYesDatabase password. Change in production.

Kafka / Redpanda

VariableDefaultRequiredDescription
REDPANDA_BROKERSlocalhost:9092YesComma-separated broker addresses
KAFKA_TOPICraw-telemetryNoTelemetry topic name
KAFKA_PARTITIONS3NoNumber of topic partitions

Ingestion Gateway

VariableDefaultRequiredDescription
GATEWAY_HTTP_PORT8080NoHTTP listener port
GATEWAY_METRICS_PORT9090NoPrometheus metrics port
INGESTION_WORKER_COUNT4NoParallel Kafka publish workers
BATCH_FIREHOSE_SIZE10000NoMax records per ClickHouse batch write
BATCH_FIREHOSE_TIMEOUT_MS5000NoMax wait time before flushing a batch (ms)
BATCH_EXEMPLAR_SIZE2000NoMax exemplar records per batch
BATCH_EXEMPLAR_TIMEOUT_MS10000NoMax wait time for exemplar batch flush (ms)

Telemetry Operator

VariableDefaultRequiredDescription
OPERATOR_HTTP_PORT8081NoHTTP listener port
OPERATOR_METRICS_PORT9091NoPrometheus metrics port
OPERATOR_WORKER_COUNT2NoNumber of Kafka consumer workers

AIops Engine

VariableDefaultRequiredDescription
AIOPS_HTTP_PORT8080NoHTTP listener port
AIOPS_METRICS_PORT9092NoPrometheus metrics port
ALERTMANAGER_WEBHOOK_PORT9093NoPort for Prometheus Alertmanager webhook
WATCHDOG_INTERVAL_SECONDS60NoHow often the anomaly watchdog polls ClickHouse
WATCHDOG_Z_SCORE_THRESHOLD3.0NoZ-score threshold for anomaly declaration. Lower = more sensitive.
WATCHDOG_RCA_COOLDOWN_MINUTES15NoMinimum minutes between RCA runs for the same service/metric
VECTORIZER_INTERVAL_SECONDS60NoHow often to rebuild the HNSW vector index

LLM / AI

VariableDefaultRequiredDescription
LLM_PROVIDERanthropicNoLLM backend. Currently only anthropic is supported.
ANTHROPIC_API_KEYYes (for RCA)Anthropic API key. Get one at console.anthropic.com.
ANTHROPIC_MODELclaude-opus-4-6NoClaude model to use for RCA analysis
tip

Without ANTHROPIC_API_KEY, anomaly detection and alerting still work. Only AI-generated RCA summaries are disabled.


Vector Index (HNSW)

VariableDefaultRequiredDescription
VECTOR_HNSW_M16NoHNSW graph connectivity. Higher = better recall, more memory.
VECTOR_HNSW_EF_CONSTRUCTION200NoBuild-time search width. Higher = better index quality, slower build.
VECTOR_HNSW_EF_SEARCH50NoQuery-time search width. Higher = better recall, slower queries.

For large-scale deployments (500+ services), set VECTOR_HNSW_M=24 and VECTOR_HNSW_EF_CONSTRUCTION=400.


Integrations

Slack

VariableDefaultDescription
SLACK_WEBHOOK_URLIncoming webhook URL for alert notifications
SLACK_BOT_TOKENBot token for interactive approval flows (optional)
SLACK_CHANNEL#alertsDefault alert channel

PagerDuty

VariableDefaultDescription
PAGERDUTY_API_TOKENPagerDuty API token
PAGERDUTY_SERVICE_KEYIntegration key for incident creation

Jira

VariableDefaultDescription
JIRA_API_TOKENJira API token
JIRA_DOMAINYour Jira domain (e.g. mycompany.atlassian.net)
JIRA_PROJECT_KEYOPSProject key for auto-created tickets
JIRA_USERNAMEJira account email/username

Microsoft Teams

VariableDefaultDescription
TEAMS_WEBHOOK_URLTeams incoming webhook URL

AWS CloudWatch

VariableDefaultDescription
AWS_REGIONAWS region (e.g. us-east-1)
AWS_ACCESS_KEY_IDAWS access key (or use IAM role)
AWS_SECRET_ACCESS_KEYAWS secret key (or use IAM role)
CLOUDWATCH_POLL_INTERVAL_SECONDS60How often to poll CloudWatch metrics

Grafana

VariableDefaultDescription
GF_SECURITY_ADMIN_PASSWORDadminGrafana admin password. Change in production.
GF_SECURITY_ADMIN_USERadminGrafana admin username

Complete .env Example

# Core
ENVIRONMENT=production
LOG_LEVEL=info

# ClickHouse
CLICKHOUSE_ADDR=clickhouse:9000
CLICKHOUSE_DB=infrasage
CLICKHOUSE_USER=infrasage
CLICKHOUSE_PASSWORD=CHANGE_ME_SECURE_PASSWORD

# Kafka
REDPANDA_BROKERS=redpanda:29092

# Ingestion tuning (medium scale: 50-500 services)
INGESTION_WORKER_COUNT=16
BATCH_FIREHOSE_SIZE=50000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=10000

# AIops Engine
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15

# AI
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-YOUR_KEY_HERE

# Integrations
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK
PAGERDUTY_API_TOKEN=your-pd-token
JIRA_API_TOKEN=your-jira-token
JIRA_DOMAIN=mycompany.atlassian.net
JIRA_USERNAME=ops@mycompany.com

# Grafana
GF_SECURITY_ADMIN_PASSWORD=CHANGE_ME_SECURE_PASSWORD