Cost Optimization
This page covers strategies to reduce InfraSage event volume, infrastructure costs, and LLM API spend — without degrading detection quality.
Understand Your Event Volume
Before optimizing, measure what you're ingesting. Query your top event producers:
-- Top services by event count (last 24h)
SELECT
service_id,
type,
count() AS event_count,
round(event_count / sum(event_count) OVER () * 100, 1) AS pct
FROM telemetry
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY service_id, type
ORDER BY event_count DESC
LIMIT 20;
-- Top metric names by cardinality
SELECT
metric_name,
uniqExact(service_id) AS services,
count() AS total_events
FROM telemetry
WHERE type = 'metric'
AND timestamp >= now() - INTERVAL 24 HOUR
GROUP BY metric_name
ORDER BY total_events DESC
LIMIT 20;
Reduce Event Volume
1. Sample High-Frequency Metrics
Not all metrics need to be reported every second. Metrics that change slowly (CPU, memory) don't benefit from 1-second resolution.
// Report slowly-changing metrics less frequently
type AdaptiveSampler struct {
lastValues map[string]float64
minChange float64 // only report if value changed by this much
}
func (s *AdaptiveSampler) ShouldReport(metric string, value float64) bool {
last, ok := s.lastValues[metric]
if !ok || math.Abs(value-last)/last > s.minChange {
s.lastValues[metric] = value
return true
}
return false
}
Or configure a reporting interval per metric type:
| Metric type | Recommended interval |
|---|---|
| Request latency (P99) | 15 seconds |
| Error rate | 15 seconds |
| CPU / memory | 60 seconds |
| Disk / network I/O | 60 seconds |
| Queue depth | 30 seconds |
| Business events (payments, signups) | Real-time |
2. Drop Redundant Logs
Logs are typically the largest event category. Filter at the source:
# OTEL Collector: drop debug logs before sending to InfraSage
processors:
filter/drop_debug:
logs:
exclude:
match_type: strict
severity_texts: ["DEBUG", "TRACE"]
service:
pipelines:
logs:
processors: [filter/drop_debug]
Or via InfraSage's ingestion config:
# Drop log levels below this severity
LOG_MIN_SEVERITY=info
# Truncate log bodies over this length
LOG_MAX_BODY_BYTES=2048
3. Exclude High-Volume, Low-Signal Metrics
Health check endpoints typically generate thousands of log lines with no anomaly signal. Exclude them:
LOG_EXCLUDE_PATHS=/health,/ready,/metrics
4. Aggregate Before Ingestion
Instead of sending every HTTP request as an event, aggregate into per-second summaries:
# Instead of: 1 event per request
# Send: 1 event per 15 seconds with count + latency stats
import statistics
class MetricAggregator:
def __init__(self, flush_interval_s=15):
self.buckets = {}
self.interval = flush_interval_s
def record(self, service_id, metric, value):
key = (service_id, metric)
self.buckets.setdefault(key, []).append(value)
def flush(self):
events = []
for (service_id, metric), values in self.buckets.items():
events.extend([
{"type": "metric", "service_id": service_id, "metric_name": f"{metric}.p50", "value": statistics.median(values), "timestamp": int(time.time() * 1000)},
{"type": "metric", "service_id": service_id, "metric_name": f"{metric}.p99", "value": statistics.quantiles(values, n=100)[98], "timestamp": int(time.time() * 1000)},
{"type": "metric", "service_id": service_id, "metric_name": f"{metric}.count", "value": len(values), "timestamp": int(time.time() * 1000)},
])
self.buckets.clear()
return events
Choose the Right Scale Profile
The scale profile determines resource allocation. Over-provisioning is the most common cost driver.
| Profile | Recommended for | Monthly infra cost (est.) |
|---|---|---|
small | < 1M events/day, < 10 services | ~$50–150/mo |
medium | 1M–50M events/day, 10–50 services | ~$300–800/mo |
large | 50M+ events/day, 50+ services | ~$1,500+/mo |
Check your current throughput:
curl http://infrasage:8080/metrics | grep ingestion_events_total
If you're on large but processing < 10M events/day, downgrade:
helm upgrade infrasage ./infrasage-chart \
--set global.scaleProfile=medium
See Scale Profiles for full resource specs.
Tune Retention Policy
Storing 1 year of raw logs is expensive and rarely useful. Set per-type retention:
# Aggressive retention for high-volume types
RETENTION_DAYS_LOG=14 # logs are searchable for 2 weeks
RETENTION_DAYS_METRIC=90 # metrics kept for 3 months
RETENTION_DAYS_TRACE=7 # traces for 1 week
RETENTION_DAYS_EVENT=180 # events for 6 months
RETENTION_DAYS_SLO=365 # SLO data for compliance (1 year)
To check current storage usage by type:
SELECT
type,
formatReadableSize(sum(bytes_on_disk)) AS disk_usage,
count() AS row_count
FROM system.parts
WHERE database = 'infrasage'
GROUP BY type
ORDER BY disk_usage DESC;
Reduce ClickHouse Storage
Compression
ClickHouse uses LZ4 compression by default. For log bodies (high compressibility), ZSTD gives better ratios:
ALTER TABLE telemetry
MODIFY COLUMN log_body String CODEC(ZSTD(3));
Tiered Storage
Move old data to cheaper object storage (S3, GCS) automatically:
<!-- ClickHouse storage policy -->
<storage_policy>
<volumes>
<hot>
<disk>default</disk>
<max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
</hot>
<cold>
<disk>s3_cold</disk>
</cold>
</volumes>
<move_factor>0.2</move_factor>
</storage_policy>
Data older than 30 days moves to S3 automatically. Cold queries are slower but data is still queryable.
Reduce LLM API Costs
RCA uses the Anthropic API when an anomaly is declared. At high anomaly volume, this adds up.
1. Raise the Anomaly Threshold
Fewer anomalies = fewer RCA calls. If you're seeing too many low-severity anomalies triggering RCA:
WATCHDOG_Z_SCORE_THRESHOLD=3.5 # default is 3.0
2. Limit RCA to High-Severity Anomalies
RCA_TRIGGER_MIN_SEVERITY=high # skip RCA for 'low' and 'medium' anomalies
3. Use a Cheaper Model for Low-Severity RCA
LLM_MODEL_HIGH_SEVERITY=claude-opus-4-7
LLM_MODEL_LOW_SEVERITY=claude-haiku-4-5-20251001
4. Cache RCA Results for Repeated Patterns
RCA_CACHE_TTL_MINUTES=30
RCA_CACHE_SIMILARITY_THRESHOLD=0.85
If two anomalies have similar causal graphs within 30 minutes, InfraSage reuses the previous RCA result instead of calling the LLM again.
5. Use a Self-Hosted LLM
Zero API cost, at the expense of model quality. See Data Residency.
Cost Monitoring
Track InfraSage's own resource consumption in Grafana:
# ClickHouse disk growth rate (bytes/hour)
rate(clickhouse_disk_bytes_total[1h])
# LLM API calls per hour
rate(rca_llm_calls_total[1h])
# Ingestion rate (events/second)
rate(ingestion_events_total[1m])
Set an alert if ingestion rate grows unexpectedly (could indicate a misconfigured service flooding logs):
# Alertmanager rule
- alert: InfraSageIngestionSpike
expr: rate(ingestion_events_total[5m]) > 50000
for: 5m
annotations:
summary: "Ingestion rate spike — possible log flood"