Environment Variables
All InfraSage services are configured via environment variables. Set them in your .env file (Docker Compose), Kubernetes Secret, or your deployment's environment configuration.
Core / Shared
| Variable | Default | Required | Description |
|---|---|---|---|
ENVIRONMENT | development | No | Deployment environment: development, staging, production |
LOG_LEVEL | info | No | Log verbosity: debug, info, warn, error |
ClickHouse
| Variable | Default | Required | Description |
|---|---|---|---|
CLICKHOUSE_ADDR | localhost:9000 | Yes | ClickHouse native protocol address |
CLICKHOUSE_DB | infrasage | Yes | Database name |
CLICKHOUSE_USER | infrasage | Yes | Database username |
CLICKHOUSE_PASSWORD | infrasage-dev | Yes | Database password. Change in production. |
Kafka / Redpanda
| Variable | Default | Required | Description |
|---|---|---|---|
REDPANDA_BROKERS | localhost:9092 | Yes | Comma-separated broker addresses |
KAFKA_TOPIC | raw-telemetry | No | Telemetry topic name |
KAFKA_PARTITIONS | 3 | No | Number of topic partitions |
Ingestion Gateway
| Variable | Default | Required | Description |
|---|---|---|---|
GATEWAY_HTTP_PORT | 8080 | No | HTTP listener port |
GATEWAY_METRICS_PORT | 9090 | No | Prometheus metrics port |
INGESTION_WORKER_COUNT | 4 | No | Parallel Kafka publish workers |
BATCH_FIREHOSE_SIZE | 10000 | No | Max records per ClickHouse batch write |
BATCH_FIREHOSE_TIMEOUT_MS | 5000 | No | Max wait time before flushing a batch (ms) |
BATCH_EXEMPLAR_SIZE | 2000 | No | Max exemplar records per batch |
BATCH_EXEMPLAR_TIMEOUT_MS | 10000 | No | Max wait time for exemplar batch flush (ms) |
Telemetry Operator
| Variable | Default | Required | Description |
|---|---|---|---|
OPERATOR_HTTP_PORT | 8081 | No | HTTP listener port |
OPERATOR_METRICS_PORT | 9091 | No | Prometheus metrics port |
OPERATOR_WORKER_COUNT | 2 | No | Number of Kafka consumer workers |
AIops Engine
| Variable | Default | Required | Description |
|---|---|---|---|
AIOPS_HTTP_PORT | 8080 | No | HTTP listener port |
AIOPS_METRICS_PORT | 9092 | No | Prometheus metrics port |
ALERTMANAGER_WEBHOOK_PORT | 9093 | No | Port for Prometheus Alertmanager webhook |
WATCHDOG_INTERVAL_SECONDS | 60 | No | How often the anomaly watchdog polls ClickHouse |
WATCHDOG_Z_SCORE_THRESHOLD | 3.0 | No | Z-score threshold for anomaly declaration. Lower = more sensitive. |
WATCHDOG_RCA_COOLDOWN_MINUTES | 15 | No | Minimum minutes between RCA runs for the same service/metric |
VECTORIZER_INTERVAL_SECONDS | 60 | No | How often to rebuild the HNSW vector index |
LLM / AI
| Variable | Default | Required | Description |
|---|---|---|---|
LLM_PROVIDER | anthropic | No | LLM backend. Currently only anthropic is supported. |
ANTHROPIC_API_KEY | — | Yes (for RCA) | Anthropic API key. Get one at console.anthropic.com. |
ANTHROPIC_MODEL | claude-opus-4-6 | No | Claude model to use for RCA analysis |
tip
Without ANTHROPIC_API_KEY, anomaly detection and alerting still work. Only AI-generated RCA summaries are disabled.
Vector Index (HNSW)
| Variable | Default | Required | Description |
|---|---|---|---|
VECTOR_HNSW_M | 16 | No | HNSW graph connectivity. Higher = better recall, more memory. |
VECTOR_HNSW_EF_CONSTRUCTION | 200 | No | Build-time search width. Higher = better index quality, slower build. |
VECTOR_HNSW_EF_SEARCH | 50 | No | Query-time search width. Higher = better recall, slower queries. |
For large-scale deployments (500+ services), set VECTOR_HNSW_M=24 and VECTOR_HNSW_EF_CONSTRUCTION=400.
Integrations
Slack
| Variable | Default | Description |
|---|---|---|
SLACK_WEBHOOK_URL | — | Incoming webhook URL for alert notifications |
SLACK_BOT_TOKEN | — | Bot token for interactive approval flows (optional) |
SLACK_CHANNEL | #alerts | Default alert channel |
PagerDuty
| Variable | Default | Description |
|---|---|---|
PAGERDUTY_API_TOKEN | — | PagerDuty API token |
PAGERDUTY_SERVICE_KEY | — | Integration key for incident creation |
Jira
| Variable | Default | Description |
|---|---|---|
JIRA_API_TOKEN | — | Jira API token |
JIRA_DOMAIN | — | Your Jira domain (e.g. mycompany.atlassian.net) |
JIRA_PROJECT_KEY | OPS | Project key for auto-created tickets |
JIRA_USERNAME | — | Jira account email/username |
Microsoft Teams
| Variable | Default | Description |
|---|---|---|
TEAMS_WEBHOOK_URL | — | Teams incoming webhook URL |
AWS CloudWatch
| Variable | Default | Description |
|---|---|---|
AWS_REGION | — | AWS region (e.g. us-east-1) |
AWS_ACCESS_KEY_ID | — | AWS access key (or use IAM role) |
AWS_SECRET_ACCESS_KEY | — | AWS secret key (or use IAM role) |
CLOUDWATCH_POLL_INTERVAL_SECONDS | 60 | How often to poll CloudWatch metrics |
Grafana
| Variable | Default | Description |
|---|---|---|
GF_SECURITY_ADMIN_PASSWORD | admin | Grafana admin password. Change in production. |
GF_SECURITY_ADMIN_USER | admin | Grafana admin username |
Complete .env Example
# Core
ENVIRONMENT=production
LOG_LEVEL=info
# ClickHouse
CLICKHOUSE_ADDR=clickhouse:9000
CLICKHOUSE_DB=infrasage
CLICKHOUSE_USER=infrasage
CLICKHOUSE_PASSWORD=CHANGE_ME_SECURE_PASSWORD
# Kafka
REDPANDA_BROKERS=redpanda:29092
# Ingestion tuning (medium scale: 50-500 services)
INGESTION_WORKER_COUNT=16
BATCH_FIREHOSE_SIZE=50000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=10000
# AIops Engine
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
# AI
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-YOUR_KEY_HERE
# Integrations
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK
PAGERDUTY_API_TOKEN=your-pd-token
JIRA_API_TOKEN=your-jira-token
JIRA_DOMAIN=mycompany.atlassian.net
JIRA_USERNAME=ops@mycompany.com
# Grafana
GF_SECURITY_ADMIN_PASSWORD=CHANGE_ME_SECURE_PASSWORD