Skip to main content

Frequently Asked Questions


General

What is InfraSage?

InfraSage is a self-hosted AIOps platform. It ingests metrics, logs, traces, and events from your Kubernetes infrastructure, runs ML-based anomaly detection, performs AI-powered root cause analysis, and can execute automated runbooks in response to incidents.

Is InfraSage an observability tool like Grafana or Prometheus?

Observability (metrics, logs, traces) is the data layer. InfraSage sits on top of that: it processes the data with ML and AI to tell you what's wrong and why, not just display it. You can use Grafana alongside InfraSage — they complement each other.

What data does InfraSage send externally?

When RCA is triggered, InfraSage sends a structured analytical prompt (service IDs, metric summaries, causal graph) to the Anthropic API. No raw logs, traces, or user data are included. You can disable this entirely by setting LLM_PROVIDER=none or pointing to a self-hosted LLM. See Data Residency.


Deployment & Setup

How long does it take to deploy?

The Helm chart installs in under 5 minutes. Initial anomaly baselines build after 24–48 hours of data collection. See Quick Start.

What are the minimum infrastructure requirements?

For the small profile (up to 1M events/day): 4 vCPUs, 8GB RAM, 100GB storage. For production workloads, see Scale Profiles.

Does InfraSage require Kubernetes?

For self-hosted, yes — the platform is Kubernetes-native and deployed via Helm. InfraSage Cloud has no infrastructure requirements.

Can I run InfraSage on a single node?

Yes for development and evaluation. For production, distribute ClickHouse and Kafka across at least 2 nodes for resilience.

Can I use an external ClickHouse or Kafka?

Yes. Set CLICKHOUSE_ADDR, CLICKHOUSE_USER, and CLICKHOUSE_PASSWORD to point to your existing ClickHouse instance. Similarly for Kafka. This is useful if you already operate these as shared infrastructure.


Data & Compliance

Is InfraSage GDPR-compliant?

When self-hosted in an EU VPC, yes — all data stays within your environment. InfraSage, Inc. does not process your telemetry. If you use the Anthropic API for RCA, you should confirm Anthropic's data processing terms satisfy your DPA requirements, or disable the LLM integration entirely.

Do I need a BAA with InfraSage for HIPAA?

No. Because InfraSage is self-hosted and InfraSage, Inc. never processes your data, there is no business associate relationship requiring a BAA.

How long is data retained?

By default, 90 days for metrics and 30 days for logs. Both are configurable via environment variables. See Retention Policy docs.

Can I delete a specific tenant's data?

Yes, via the admin API:

curl -X DELETE http://infrasage:8080/api/v1/tenants/my-tenant \
-H "Authorization: Bearer $SUPER_ADMIN_JWT"

This purges all telemetry, anomalies, and RCA data for that tenant.


Integrations

Can I use InfraSage with my existing Prometheus setup?

Yes. InfraSage exposes a Prometheus remote-write endpoint at /api/v1/prometheus/remote_write. Add it as a remote_write target in your Prometheus config to forward metrics automatically.

Does InfraSage replace Prometheus + Grafana?

Not necessarily. InfraSage focuses on anomaly detection, RCA, and runbook automation. Many teams run InfraSage alongside Prometheus and Grafana — Grafana for dashboards and ad-hoc querying, InfraSage for AI-driven alerting and incident response.

Does InfraSage support OpenTelemetry?

Yes. The Ingestion Gateway accepts OTLP over HTTP and gRPC. See OpenTelemetry Integration.

Can I ingest from AWS CloudWatch?

Yes. InfraSage's integrationPoller polls CloudWatch metrics from EC2, RDS, Lambda, ALB, DynamoDB, S3, and SNS. See AWS CloudWatch.


Pricing

What is the difference between Free, Pro, and Enterprise?

FreeProEnterprise
Events/day100K10MUnlimited
Users325Unlimited
Tenants15Unlimited
RCA
Runbooks
SLANone99.9%99.99%

See Billing Plans for the full feature comparison.

What counts as an "event"?

Each individual telemetry record counts as one event: a metric data point, a log line, a trace span, or a custom event. Batch submissions count each item in the batch individually.

What does the Anthropic API cost?

Anthropic API costs for RCA are separate from InfraSage subscription fees and billed directly by Anthropic. Typical RCA calls use 2,000–5,000 tokens. At current Sonnet pricing ($3/MTok input), a team seeing 100 RCA events/day spends approximately $1–2/day on the Anthropic API. See Cost Optimization for strategies to reduce this.


Anomaly Detection & RCA

Why am I getting too many false-positive anomalies?

The most common cause is a Z-score threshold that's too low, or not enough baseline data. Try:

  1. Raise WATCHDOG_Z_SCORE_THRESHOLD to 3.5 or 4.0
  2. Wait 48–72 hours for adaptive thresholds to build a proper baseline
  3. Check if the affected service has irregular traffic patterns (cron jobs, batch processes)

See Anomaly Detection.

How long does RCA take?

Typically 15–30 seconds from anomaly declaration to RCA complete. This includes causal graph construction (local, fast) and the LLM call (~10–20 seconds depending on Anthropic API latency).

Can I disable RCA for specific services?

Yes:

curl -X PUT http://infrasage:8080/api/v1/tenants/my-tenant/services/my-service/config \
-H "Authorization: Bearer $JWT" \
-d '{"rca_enabled": false}'

Does RCA work without the Anthropic API?

Yes. With LLM_PROVIDER=none, InfraSage still builds the causal graph, scores blast radius, and identifies affected services. The only thing skipped is the natural-language explanation.


Operations

How do I upgrade InfraSage?

helm repo update
helm upgrade infrasage infrasage/infrasage -n infrasage -f values.yaml

Check the Changelog for breaking changes before upgrading major versions.

What happens if ClickHouse goes down?

Ingestion buffers in Kafka (default 24-hour retention). Once ClickHouse recovers, the Telemetry Operator drains the backlog automatically. No telemetry data is lost within the Kafka retention window.

How do I scale InfraSage horizontally?

The Ingestion Gateway and AIops Engine are stateless — scale them by increasing replica counts. ClickHouse scales vertically (larger nodes) or horizontally via sharding. See Scale Profiles.