Skip to main content

Scale Profiles

InfraSage can handle anywhere from a handful of services to thousands. Use these configuration profiles as starting points and tune based on your observed throughput and latency.


Choosing a Profile

ProfileServicesEvents/secRAM NeededTypical Use
Small10–50Up to 10K/sec8 GBStaging, small teams, dev
Medium50–500Up to 100K/sec32 GBMid-size production
Large500–5,000Up to 1M/sec128 GB+Enterprise, multi-region

Small (10–50 Services)

Single Kubernetes node. Suitable for development, staging, and small production environments.

INGESTION_WORKER_COUNT=4
BATCH_FIREHOSE_SIZE=10000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=2000
BATCH_EXEMPLAR_TIMEOUT_MS=10000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=16
VECTOR_HNSW_EF_CONSTRUCTION=200
OPERATOR_WORKER_COUNT=2

Recommended hardware:

  • t3.xlarge (4 vCPU, 16 GB RAM) — runs all components on one host
  • 50 GB gp3 EBS for ClickHouse data

Medium (50–500 Services)

Kubernetes deployment. Separate ClickHouse and Redpanda instances.

INGESTION_WORKER_COUNT=16
BATCH_FIREHOSE_SIZE=50000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=10000
BATCH_EXEMPLAR_TIMEOUT_MS=5000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=16
VECTOR_HNSW_EF_CONSTRUCTION=200
OPERATOR_WORKER_COUNT=4

Recommended hardware:

  • Ingestion Gateway: 3× m5.large (2 vCPU, 8 GB) with HPA
  • Telemetry Operator: 2× m5.large
  • AIops Engine: 1× m5.xlarge (4 vCPU, 16 GB)
  • ClickHouse: r5.2xlarge (8 vCPU, 64 GB RAM) with 500 GB gp3
  • Redpanda: 3-node cluster on m5.large

Large (500–5,000 Services)

Kubernetes with dedicated node pools. ClickHouse cluster.

INGESTION_WORKER_COUNT=32
BATCH_FIREHOSE_SIZE=100000
BATCH_FIREHOSE_TIMEOUT_MS=2000
BATCH_EXEMPLAR_SIZE=20000
BATCH_EXEMPLAR_TIMEOUT_MS=2000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=24
VECTOR_HNSW_EF_CONSTRUCTION=400
VECTOR_HNSW_EF_SEARCH=100
OPERATOR_WORKER_COUNT=8

Recommended hardware:

  • Ingestion Gateway: 5–10× m5.2xlarge (8 vCPU, 32 GB) with HPA
  • Telemetry Operator: 4× m5.xlarge
  • AIops Engine: 2× m5.2xlarge
  • ClickHouse: 3-node r5.4xlarge cluster (16 vCPU, 128 GB each)
  • Redpanda: 3–5 node cluster on m5.2xlarge
  • Dedicated Kafka partition count: 9–12

Tuning Tips

High ingestion throughput

  • Increase INGESTION_WORKER_COUNT — matches available CPU cores
  • Increase BATCH_FIREHOSE_SIZE — reduces ClickHouse write overhead
  • Decrease BATCH_FIREHOSE_TIMEOUT_MS if latency matters more than throughput

Reduce false-positive anomalies

  • Increase WATCHDOG_Z_SCORE_THRESHOLD (e.g., 4.0 or 5.0)
  • Increase WATCHDOG_RCA_COOLDOWN_MINUTES to avoid RCA spam

Faster anomaly detection

  • Decrease WATCHDOG_INTERVAL_SECONDS (min practical: 10)
  • Keep WATCHDOG_Z_SCORE_THRESHOLD at 3.0

Large incident memory / better RCA similarity

  • Increase VECTOR_HNSW_M to 24 or 32
  • Increase VECTOR_HNSW_EF_CONSTRUCTION to 400
  • Note: higher values increase memory usage and index build time

Estimated monthly costs (AWS)

ProfileClickHouseKafkaComputeLLM APITotal
Small$200$100$300$50~$650
Medium$1,000$500$2,000$500~$4,000
Large$5,000$2,000$5,000$2,000~$14,000