Kubernetes

InfraSage monitors your Kubernetes cluster by ingesting pod/node metrics, namespace events, and deployment lifecycle events.

What InfraSage Collects

Signal	Source	Examples
Pod metrics	Kubernetes Metrics API	CPU request/limit utilization, memory usage
Node metrics	Kubernetes Metrics API	Node CPU, memory, disk pressure
Pod events	Kubernetes Events API	OOMKilled, CrashLoopBackOff, FailedScheduling
Deployment events	Kubernetes Events API	Scaled, RolledBack, RolloutComplete

Configuration

RBAC for InfraSage

Create a ServiceAccount with the necessary permissions:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: infrasage-poller
  namespace: infrasage

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: infrasage-poller
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "events", "namespaces"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets", "statefulsets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["metrics.k8s.io"]
    resources: ["pods", "nodes"]
    verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: infrasage-poller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: infrasage-poller
subjects:
  - kind: ServiceAccount
    name: infrasage-poller
    namespace: infrasage

Environment Variables

# Kubernetes API server (auto-detected when running in-cluster)
KUBERNETES_API_SERVER=https://kubernetes.default.svc
KUBERNETES_SERVICE_ACCOUNT_TOKEN_PATH=/var/run/secrets/kubernetes.io/serviceaccount/token

# Polling interval
KUBERNETES_POLL_INTERVAL_SECONDS=30

# Filter by namespace (comma-separated; empty = all namespaces)
KUBERNETES_NAMESPACES=production,staging

Kubernetes Events in RCA

When InfraSage runs RCA on a metric anomaly, it automatically queries for Kubernetes events that occurred within ±5 minutes of the anomaly:

OOMKilled → surfaced as contributing factor to memory spikes
CrashLoopBackOff → correlated with elevated error rates
FailedScheduling → correlated with latency spikes during deployments
RolledBack → correlated with sudden metric improvement

This enrichment helps Claude provide highly specific root cause explanations like:

"The error rate spike at 14:03 is causally linked to an OOMKill event on checkout-api pod checkout-api-7f9d-4b2x at 14:01. Memory pressure likely caused the process to be evicted, triggering cascading failures in downstream payment-service."

Namespace-Scoped Queries

The Admin UI and API support filtering by Kubernetes namespace:

curl $INFRASAGE_URL/api/v1/anomalies \
  -H "Authorization: Bearer $YOUR_JWT" \
  -G --data-urlencode "k8s_namespace=production"

Runbook Actions for Kubernetes

InfraSage runbooks can execute Kubernetes actions directly:

{
  "type": "kubernetes",
  "action": "scale",
  "namespace": "production",
  "deployment": "checkout-api",
  "replicas": 5
}

{
  "type": "kubernetes",
  "action": "rollout-undo",
  "namespace": "production",
  "deployment": "checkout-api"
}

See Runbooks & Automation for full action reference.

Verification

# Check K8S events are being ingested
docker exec infrasage-clickhouse clickhouse-client \
  --user infrasage --password infrasage-dev \
  --query "SELECT service_id, body, timestamp
           FROM infrasage.infrasage_raw_firehose
           WHERE type = 'event'
           AND attributes LIKE '%k8s%'
           ORDER BY timestamp DESC LIMIT 10"

What InfraSage Collects​

Configuration​

RBAC for InfraSage​

Environment Variables​

Kubernetes Events in RCA​

Namespace-Scoped Queries​

Runbook Actions for Kubernetes​

Verification​