Cost Optimization

This page covers strategies to reduce InfraSage event volume, infrastructure costs, and LLM API spend — without degrading detection quality.

Understand Your Event Volume

Before optimizing, measure what you're ingesting. Query your top event producers:

-- Top services by event count (last 24h)
SELECT
    service_id,
    type,
    count() AS event_count,
    round(event_count / sum(event_count) OVER () * 100, 1) AS pct
FROM telemetry
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY service_id, type
ORDER BY event_count DESC
LIMIT 20;

-- Top metric names by cardinality
SELECT
    metric_name,
    uniqExact(service_id) AS services,
    count() AS total_events
FROM telemetry
WHERE type = 'metric'
  AND timestamp >= now() - INTERVAL 24 HOUR
GROUP BY metric_name
ORDER BY total_events DESC
LIMIT 20;

Reduce Event Volume

1. Sample High-Frequency Metrics

Not all metrics need to be reported every second. Metrics that change slowly (CPU, memory) don't benefit from 1-second resolution.

// Report slowly-changing metrics less frequently
type AdaptiveSampler struct {
    lastValues map[string]float64
    minChange  float64  // only report if value changed by this much
}

func (s *AdaptiveSampler) ShouldReport(metric string, value float64) bool {
    last, ok := s.lastValues[metric]
    if !ok || math.Abs(value-last)/last > s.minChange {
        s.lastValues[metric] = value
        return true
    }
    return false
}

Or configure a reporting interval per metric type:

Metric type	Recommended interval
Request latency (P99)	15 seconds
Error rate	15 seconds
CPU / memory	60 seconds
Disk / network I/O	60 seconds
Queue depth	30 seconds
Business events (payments, signups)	Real-time

2. Drop Redundant Logs

Logs are typically the largest event category. Filter at the source:

# OTEL Collector: drop debug logs before sending to InfraSage
processors:
  filter/drop_debug:
    logs:
      exclude:
        match_type: strict
        severity_texts: ["DEBUG", "TRACE"]

service:
  pipelines:
    logs:
      processors: [filter/drop_debug]

Or via InfraSage's ingestion config:

# Drop log levels below this severity
LOG_MIN_SEVERITY=info

# Truncate log bodies over this length
LOG_MAX_BODY_BYTES=2048

3. Exclude High-Volume, Low-Signal Metrics

Health check endpoints typically generate thousands of log lines with no anomaly signal. Exclude them:

LOG_EXCLUDE_PATHS=/health,/ready,/metrics

4. Aggregate Before Ingestion

Instead of sending every HTTP request as an event, aggregate into per-second summaries:

# Instead of: 1 event per request
# Send: 1 event per 15 seconds with count + latency stats
import statistics

class MetricAggregator:
    def __init__(self, flush_interval_s=15):
        self.buckets = {}
        self.interval = flush_interval_s

    def record(self, service_id, metric, value):
        key = (service_id, metric)
        self.buckets.setdefault(key, []).append(value)

    def flush(self):
        events = []
        for (service_id, metric), values in self.buckets.items():
            events.extend([
                {"type": "metric", "service_id": service_id, "metric_name": f"{metric}.p50",   "value": statistics.median(values),       "timestamp": int(time.time() * 1000)},
                {"type": "metric", "service_id": service_id, "metric_name": f"{metric}.p99",   "value": statistics.quantiles(values, n=100)[98], "timestamp": int(time.time() * 1000)},
                {"type": "metric", "service_id": service_id, "metric_name": f"{metric}.count", "value": len(values),                     "timestamp": int(time.time() * 1000)},
            ])
        self.buckets.clear()
        return events

Choose the Right Scale Profile

The scale profile determines resource allocation. Over-provisioning is the most common cost driver.

Profile	Recommended for	Monthly infra cost (est.)
`small`	< 1M events/day, < 10 services	~$50–150/mo
`medium`	1M–50M events/day, 10–50 services	~$300–800/mo
`large`	50M+ events/day, 50+ services	~$1,500+/mo

Check your current throughput:

curl http://infrasage:8080/metrics | grep ingestion_events_total

If you're on large but processing < 10M events/day, downgrade:

helm upgrade infrasage ./infrasage-chart \
  --set global.scaleProfile=medium

See Scale Profiles for full resource specs.

Tune Retention Policy

Storing 1 year of raw logs is expensive and rarely useful. Set per-type retention:

# Aggressive retention for high-volume types
RETENTION_DAYS_LOG=14       # logs are searchable for 2 weeks
RETENTION_DAYS_METRIC=90    # metrics kept for 3 months
RETENTION_DAYS_TRACE=7      # traces for 1 week
RETENTION_DAYS_EVENT=180    # events for 6 months
RETENTION_DAYS_SLO=365      # SLO data for compliance (1 year)

To check current storage usage by type:

SELECT
    type,
    formatReadableSize(sum(bytes_on_disk)) AS disk_usage,
    count() AS row_count
FROM system.parts
WHERE database = 'infrasage'
GROUP BY type
ORDER BY disk_usage DESC;

Reduce ClickHouse Storage

Compression

ClickHouse uses LZ4 compression by default. For log bodies (high compressibility), ZSTD gives better ratios:

ALTER TABLE telemetry
MODIFY COLUMN log_body String CODEC(ZSTD(3));

Tiered Storage

Move old data to cheaper object storage (S3, GCS) automatically:

<!-- ClickHouse storage policy -->
<storage_policy>
  <volumes>
    <hot>
      <disk>default</disk>
      <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
    </hot>
    <cold>
      <disk>s3_cold</disk>
    </cold>
  </volumes>
  <move_factor>0.2</move_factor>
</storage_policy>

Data older than 30 days moves to S3 automatically. Cold queries are slower but data is still queryable.

Reduce LLM API Costs

RCA uses the Anthropic API when an anomaly is declared. At high anomaly volume, this adds up.

1. Raise the Anomaly Threshold

Fewer anomalies = fewer RCA calls. If you're seeing too many low-severity anomalies triggering RCA:

WATCHDOG_Z_SCORE_THRESHOLD=3.5   # default is 3.0

2. Limit RCA to High-Severity Anomalies

RCA_TRIGGER_MIN_SEVERITY=high   # skip RCA for 'low' and 'medium' anomalies

3. Use a Cheaper Model for Low-Severity RCA

LLM_MODEL_HIGH_SEVERITY=claude-opus-4-7
LLM_MODEL_LOW_SEVERITY=claude-haiku-4-5-20251001

4. Cache RCA Results for Repeated Patterns

RCA_CACHE_TTL_MINUTES=30
RCA_CACHE_SIMILARITY_THRESHOLD=0.85

If two anomalies have similar causal graphs within 30 minutes, InfraSage reuses the previous RCA result instead of calling the LLM again.

5. Use a Self-Hosted LLM

Zero API cost, at the expense of model quality. See Data Residency.

Cost Monitoring

Track InfraSage's own resource consumption in Grafana:

# ClickHouse disk growth rate (bytes/hour)
rate(clickhouse_disk_bytes_total[1h])

# LLM API calls per hour
rate(rca_llm_calls_total[1h])

# Ingestion rate (events/second)
rate(ingestion_events_total[1m])

Set an alert if ingestion rate grows unexpectedly (could indicate a misconfigured service flooding logs):

# Alertmanager rule
- alert: InfraSageIngestionSpike
  expr: rate(ingestion_events_total[5m]) > 50000
  for: 5m
  annotations:
    summary: "Ingestion rate spike — possible log flood"

Understand Your Event Volume​

Reduce Event Volume​

1. Sample High-Frequency Metrics​

2. Drop Redundant Logs​

3. Exclude High-Volume, Low-Signal Metrics​

4. Aggregate Before Ingestion​

Choose the Right Scale Profile​

Tune Retention Policy​

Reduce ClickHouse Storage​

Compression​

Tiered Storage​

Reduce LLM API Costs​

1. Raise the Anomaly Threshold​

2. Limit RCA to High-Severity Anomalies​

3. Use a Cheaper Model for Low-Severity RCA​

4. Cache RCA Results for Repeated Patterns​

5. Use a Self-Hosted LLM​

Cost Monitoring​