Scale Profiles

InfraSage can handle anywhere from a handful of services to thousands. Use these configuration profiles as starting points and tune based on your observed throughput and latency.

Choosing a Profile

Profile	Services	Events/sec	RAM Needed	Typical Use
Small	10–50	Up to 10K/sec	8 GB	Staging, small teams, dev
Medium	50–500	Up to 100K/sec	32 GB	Mid-size production
Large	500–5,000	Up to 1M/sec	128 GB+	Enterprise, multi-region

Small (10–50 Services)

Single Kubernetes node. Suitable for development, staging, and small production environments.

INGESTION_WORKER_COUNT=4
BATCH_FIREHOSE_SIZE=10000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=2000
BATCH_EXEMPLAR_TIMEOUT_MS=10000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=16
VECTOR_HNSW_EF_CONSTRUCTION=200
OPERATOR_WORKER_COUNT=2

Recommended hardware:

t3.xlarge (4 vCPU, 16 GB RAM) — runs all components on one host
50 GB gp3 EBS for ClickHouse data

Medium (50–500 Services)

Kubernetes deployment. Separate ClickHouse and Redpanda instances.

INGESTION_WORKER_COUNT=16
BATCH_FIREHOSE_SIZE=50000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=10000
BATCH_EXEMPLAR_TIMEOUT_MS=5000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=16
VECTOR_HNSW_EF_CONSTRUCTION=200
OPERATOR_WORKER_COUNT=4

Recommended hardware:

Ingestion Gateway: 3× m5.large (2 vCPU, 8 GB) with HPA
Telemetry Operator: 2× m5.large
AIops Engine: 1× m5.xlarge (4 vCPU, 16 GB)
ClickHouse: r5.2xlarge (8 vCPU, 64 GB RAM) with 500 GB gp3
Redpanda: 3-node cluster on m5.large

Large (500–5,000 Services)

Kubernetes with dedicated node pools. ClickHouse cluster.

INGESTION_WORKER_COUNT=32
BATCH_FIREHOSE_SIZE=100000
BATCH_FIREHOSE_TIMEOUT_MS=2000
BATCH_EXEMPLAR_SIZE=20000
BATCH_EXEMPLAR_TIMEOUT_MS=2000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=24
VECTOR_HNSW_EF_CONSTRUCTION=400
VECTOR_HNSW_EF_SEARCH=100
OPERATOR_WORKER_COUNT=8

Recommended hardware:

Ingestion Gateway: 5–10× m5.2xlarge (8 vCPU, 32 GB) with HPA
Telemetry Operator: 4× m5.xlarge
AIops Engine: 2× m5.2xlarge
ClickHouse: 3-node r5.4xlarge cluster (16 vCPU, 128 GB each)
Redpanda: 3–5 node cluster on m5.2xlarge
Dedicated Kafka partition count: 9–12

Tuning Tips

High ingestion throughput

Increase INGESTION_WORKER_COUNT — matches available CPU cores
Increase BATCH_FIREHOSE_SIZE — reduces ClickHouse write overhead
Decrease BATCH_FIREHOSE_TIMEOUT_MS if latency matters more than throughput

Reduce false-positive anomalies

Increase WATCHDOG_Z_SCORE_THRESHOLD (e.g., 4.0 or 5.0)
Increase WATCHDOG_RCA_COOLDOWN_MINUTES to avoid RCA spam

Faster anomaly detection

Decrease WATCHDOG_INTERVAL_SECONDS (min practical: 10)
Keep WATCHDOG_Z_SCORE_THRESHOLD at 3.0

Large incident memory / better RCA similarity

Increase VECTOR_HNSW_M to 24 or 32
Increase VECTOR_HNSW_EF_CONSTRUCTION to 400
Note: higher values increase memory usage and index build time

Estimated monthly costs (AWS)

Profile	ClickHouse	Kafka	Compute	LLM API	Total
Small	$200	$100	$300	$50	~$650
Medium	$1,000	$500	$2,000	$500	~$4,000
Large	$5,000	$2,000	$5,000	$2,000	~$14,000

Choosing a Profile​

Small (10–50 Services)​

Medium (50–500 Services)​

Large (500–5,000 Services)​

Tuning Tips​

High ingestion throughput​

Reduce false-positive anomalies​

Faster anomaly detection​

Large incident memory / better RCA similarity​

Estimated monthly costs (AWS)​

Choosing a Profile

Small (10–50 Services)

Medium (50–500 Services)

Large (500–5,000 Services)

Tuning Tips

High ingestion throughput

Reduce false-positive anomalies

Faster anomaly detection

Large incident memory / better RCA similarity

Estimated monthly costs (AWS)