Multi-Region Deployment

InfraSage can be deployed across multiple regions or clouds. This page covers three patterns: independent per-region deployments, a hub-and-spoke federation, and active-active with cross-region replication.

When You Need Multi-Region

Data residency: Different regions have different regulatory constraints (EU, India, US). You need telemetry to stay within each region's legal boundary.
Latency: Ingestion latency to a central region is unacceptable for services in distant regions.
High availability: You want InfraSage to survive a full region failure.
Scale: A single region cannot handle your total event volume.

Pattern 1: Independent Per-Region Deployments (Recommended)

Each region runs a fully independent InfraSage stack. No telemetry crosses region boundaries. Best for data residency requirements.

  EU-WEST-1 VPC                    AP-SOUTH-1 VPC
┌──────────────────────┐         ┌──────────────────────┐
│ InfraSage (EU)        │         │ InfraSage (India)     │
│  - Ingestion Gateway  │         │  - Ingestion Gateway  │
│  - Telemetry Operator │         │  - Telemetry Operator │
│  - AIops Engine       │         │  - AIops Engine       │
│  - ClickHouse (EU)    │         │  - ClickHouse (IN)    │
│  - Kafka              │         │  - Kafka              │
└──────────────────────┘         └──────────────────────┘
         │                                  │
         ▼                                  ▼
  EU services only                  India services only

Setup

Deploy the Helm chart once per region. Use separate values.yaml files:

# values-eu.yaml
ingestionGateway:
  replicaCount: 3
  env:
    TENANT_ID: acme-eu
    REGION: eu-west-1

clickhouse:
  persistence:
    storageClass: gp3-eu-west-1
    size: 500Gi

helm install infrasage ./infrasage-chart \
  -n infrasage \
  -f values-eu.yaml \
  --kube-context eu-west-1-cluster

helm install infrasage ./infrasage-chart \
  -n infrasage \
  -f values-india.yaml \
  --kube-context ap-south-1-cluster

Each deployment gets its own API keys and tenant configuration. Your services in each region point to their local Ingestion Gateway.

Trade-offs


✅ Complete data residency	❌ No unified cross-region view
✅ Region failure is isolated	❌ Separate admin per region
✅ Simplest to operate	❌ Anomalies are not correlated cross-region

Pattern 2: Hub-and-Spoke Federation

Regional InfraSage instances handle ingestion and local anomaly detection. A central "hub" instance aggregates alerts and provides a unified view. Only alert metadata (no raw telemetry) is forwarded to the hub.

                     HUB (Central)
                   ┌───────────────┐
                   │  AIops Engine │
                   │  Admin UI     │
                   │  (alerts only)│
                   └───────┬───────┘
                           │ webhook (alerts only)
          ┌────────────────┴────────────────┐
          │                                 │
  ┌───────┴──────┐                 ┌────────┴─────┐
  │ Spoke: EU    │                 │ Spoke: India │
  │ Full stack   │                 │ Full stack   │
  └──────────────┘                 └──────────────┘

Hub Configuration

The hub receives anomaly webhooks from each spoke's Alertmanager integration:

# Hub alertmanager config (receives from spokes)
receivers:
  - name: infrasage-hub
    webhook_configs:
      - url: http://infrasage-aiops.hub.internal/api/v1/alerts/webhook
        send_resolved: true

Each spoke's AIops Engine is configured to forward alerts:

# Spoke environment variable
ALERT_FORWARD_URL=http://infrasage-hub.central.internal/api/v1/alerts/webhook
ALERT_FORWARD_INCLUDE_METADATA=true
ALERT_FORWARD_INCLUDE_RAW_TELEMETRY=false  # Raw telemetry stays in region

Trade-offs


✅ Unified alert view across regions	❌ More complex to configure
✅ Cross-region incident correlation	❌ Hub is a single point of failure for the unified view
✅ Raw telemetry stays in region	❌ Alert metadata crosses region boundary

Pattern 3: Active-Active with ClickHouse Replication

For HA scenarios where you want InfraSage itself to survive a region failure, use ClickHouse Keeper (replicated ClickHouse) across regions with ZooKeeper or ClickHouse Keeper for coordination.

warning

This pattern is operationally complex and requires ClickHouse expertise. Only use it if your SLA requires InfraSage itself to have cross-region HA. For most teams, Pattern 1 with a per-region stack is simpler and sufficient.

ClickHouse Replication Topology

  Region A                        Region B
┌────────────────┐             ┌────────────────┐
│ ClickHouse     │◄──replicate─►│ ClickHouse     │
│ (shard 1,      │             │ (shard 1,      │
│  replica 1)    │             │  replica 2)    │
└────────────────┘             └────────────────┘
         │                              │
         └──────────── reads ───────────┘
                     (either region)

Key ClickHouse config for cross-region replication:

<!-- config.xml -->
<remote_servers>
  <infrasage_cluster>
    <shard>
      <replica>
        <host>clickhouse-a.region-a.internal</host>
        <port>9000</port>
      </replica>
      <replica>
        <host>clickhouse-b.region-b.internal</host>
        <port>9000</port>
      </replica>
    </shard>
  </infrasage_cluster>
</remote_servers>

Set in InfraSage:

CLICKHOUSE_CLUSTER=infrasage_cluster
CLICKHOUSE_REPLICATION_ENABLED=true

Kubernetes Multi-Cluster Considerations

Service Discovery

If your services span clusters, point each service to the Ingestion Gateway in its own cluster. Do not route telemetry cross-cluster for ingestion.

# Service config pointing to local gateway
INFRASAGE_ENDPOINT=http://infrasage-gateway.infrasage.svc.cluster.local:8080

Unified Kubeconfig for Operations

Use a merged kubeconfig for operating multiple clusters:

KUBECONFIG=~/.kube/eu-cluster:~/.kube/india-cluster kubectl config get-contexts

Cross-Cluster Runbooks

Runbooks can target pods in a specific cluster by configuring the KUBERNETES_KUBECONFIG per AIops Engine instance:

# In the EU InfraSage deployment
KUBERNETES_KUBECONFIG=/etc/kubeconfig/eu-cluster.yaml
KUBERNETES_NAMESPACE_SCOPE=infra,payments,auth

Cost Considerations

Running multiple full InfraSage stacks multiplies infrastructure costs. To reduce cost:

Use small scale profile for lower-volume regions
Disable ML Engine replicas in regions where predictive capacity is not needed
Share the Grafana instance across regions (point to multiple Prometheus remotes)

See Cost Optimization for detailed guidance.

When You Need Multi-Region​

Pattern 1: Independent Per-Region Deployments (Recommended)​

Setup​

Trade-offs​

Pattern 2: Hub-and-Spoke Federation​

Hub Configuration​

Trade-offs​

Pattern 3: Active-Active with ClickHouse Replication​

ClickHouse Replication Topology​

Kubernetes Multi-Cluster Considerations​

Service Discovery​

Unified Kubeconfig for Operations​

Cross-Cluster Runbooks​

Cost Considerations​

When You Need Multi-Region

Pattern 1: Independent Per-Region Deployments (Recommended)

Setup

Trade-offs

Pattern 2: Hub-and-Spoke Federation

Hub Configuration

Trade-offs

Pattern 3: Active-Active with ClickHouse Replication

ClickHouse Replication Topology

Kubernetes Multi-Cluster Considerations

Service Discovery

Unified Kubeconfig for Operations

Cross-Cluster Runbooks

Cost Considerations