Scale Profiles
InfraSage can handle anywhere from a handful of services to thousands. Use these configuration profiles as starting points and tune based on your observed throughput and latency.
Choosing a Profile
| Profile | Services | Events/sec | RAM Needed | Typical Use |
|---|---|---|---|---|
| Small | 10–50 | Up to 10K/sec | 8 GB | Staging, small teams, dev |
| Medium | 50–500 | Up to 100K/sec | 32 GB | Mid-size production |
| Large | 500–5,000 | Up to 1M/sec | 128 GB+ | Enterprise, multi-region |
Small (10–50 Services)
Single Kubernetes node. Suitable for development, staging, and small production environments.
INGESTION_WORKER_COUNT=4
BATCH_FIREHOSE_SIZE=10000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=2000
BATCH_EXEMPLAR_TIMEOUT_MS=10000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=16
VECTOR_HNSW_EF_CONSTRUCTION=200
OPERATOR_WORKER_COUNT=2
Recommended hardware:
t3.xlarge(4 vCPU, 16 GB RAM) — runs all components on one host- 50 GB gp3 EBS for ClickHouse data
Medium (50–500 Services)
Kubernetes deployment. Separate ClickHouse and Redpanda instances.
INGESTION_WORKER_COUNT=16
BATCH_FIREHOSE_SIZE=50000
BATCH_FIREHOSE_TIMEOUT_MS=5000
BATCH_EXEMPLAR_SIZE=10000
BATCH_EXEMPLAR_TIMEOUT_MS=5000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=16
VECTOR_HNSW_EF_CONSTRUCTION=200
OPERATOR_WORKER_COUNT=4
Recommended hardware:
- Ingestion Gateway: 3×
m5.large(2 vCPU, 8 GB) with HPA - Telemetry Operator: 2×
m5.large - AIops Engine: 1×
m5.xlarge(4 vCPU, 16 GB) - ClickHouse:
r5.2xlarge(8 vCPU, 64 GB RAM) with 500 GB gp3 - Redpanda: 3-node cluster on
m5.large
Large (500–5,000 Services)
Kubernetes with dedicated node pools. ClickHouse cluster.
INGESTION_WORKER_COUNT=32
BATCH_FIREHOSE_SIZE=100000
BATCH_FIREHOSE_TIMEOUT_MS=2000
BATCH_EXEMPLAR_SIZE=20000
BATCH_EXEMPLAR_TIMEOUT_MS=2000
WATCHDOG_INTERVAL_SECONDS=60
WATCHDOG_Z_SCORE_THRESHOLD=3.0
WATCHDOG_RCA_COOLDOWN_MINUTES=15
VECTOR_HNSW_M=24
VECTOR_HNSW_EF_CONSTRUCTION=400
VECTOR_HNSW_EF_SEARCH=100
OPERATOR_WORKER_COUNT=8
Recommended hardware:
- Ingestion Gateway: 5–10×
m5.2xlarge(8 vCPU, 32 GB) with HPA - Telemetry Operator: 4×
m5.xlarge - AIops Engine: 2×
m5.2xlarge - ClickHouse: 3-node
r5.4xlargecluster (16 vCPU, 128 GB each) - Redpanda: 3–5 node cluster on
m5.2xlarge - Dedicated Kafka partition count: 9–12
Tuning Tips
High ingestion throughput
- Increase
INGESTION_WORKER_COUNT— matches available CPU cores - Increase
BATCH_FIREHOSE_SIZE— reduces ClickHouse write overhead - Decrease
BATCH_FIREHOSE_TIMEOUT_MSif latency matters more than throughput
Reduce false-positive anomalies
- Increase
WATCHDOG_Z_SCORE_THRESHOLD(e.g.,4.0or5.0) - Increase
WATCHDOG_RCA_COOLDOWN_MINUTESto avoid RCA spam
Faster anomaly detection
- Decrease
WATCHDOG_INTERVAL_SECONDS(min practical:10) - Keep
WATCHDOG_Z_SCORE_THRESHOLDat3.0
Large incident memory / better RCA similarity
- Increase
VECTOR_HNSW_Mto24or32 - Increase
VECTOR_HNSW_EF_CONSTRUCTIONto400 - Note: higher values increase memory usage and index build time
Estimated monthly costs (AWS)
| Profile | ClickHouse | Kafka | Compute | LLM API | Total |
|---|---|---|---|---|---|
| Small | $200 | $100 | $300 | $50 | ~$650 |
| Medium | $1,000 | $500 | $2,000 | $500 | ~$4,000 |
| Large | $5,000 | $2,000 | $5,000 | $2,000 | ~$14,000 |