Skip to main content

Kubernetes

InfraSage monitors your Kubernetes cluster by ingesting pod/node metrics, namespace events, and deployment lifecycle events.


What InfraSage Collects

SignalSourceExamples
Pod metricsKubernetes Metrics APICPU request/limit utilization, memory usage
Node metricsKubernetes Metrics APINode CPU, memory, disk pressure
Pod eventsKubernetes Events APIOOMKilled, CrashLoopBackOff, FailedScheduling
Deployment eventsKubernetes Events APIScaled, RolledBack, RolloutComplete

Configuration

RBAC for InfraSage

Create a ServiceAccount with the necessary permissions:

apiVersion: v1
kind: ServiceAccount
metadata:
name: infrasage-poller
namespace: infrasage

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: infrasage-poller
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "events", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: infrasage-poller
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: infrasage-poller
subjects:
- kind: ServiceAccount
name: infrasage-poller
namespace: infrasage

Environment Variables

# Kubernetes API server (auto-detected when running in-cluster)
KUBERNETES_API_SERVER=https://kubernetes.default.svc
KUBERNETES_SERVICE_ACCOUNT_TOKEN_PATH=/var/run/secrets/kubernetes.io/serviceaccount/token

# Polling interval
KUBERNETES_POLL_INTERVAL_SECONDS=30

# Filter by namespace (comma-separated; empty = all namespaces)
KUBERNETES_NAMESPACES=production,staging

Kubernetes Events in RCA

When InfraSage runs RCA on a metric anomaly, it automatically queries for Kubernetes events that occurred within ±5 minutes of the anomaly:

  • OOMKilled → surfaced as contributing factor to memory spikes
  • CrashLoopBackOff → correlated with elevated error rates
  • FailedScheduling → correlated with latency spikes during deployments
  • RolledBack → correlated with sudden metric improvement

This enrichment helps Claude provide highly specific root cause explanations like:

"The error rate spike at 14:03 is causally linked to an OOMKill event on checkout-api pod checkout-api-7f9d-4b2x at 14:01. Memory pressure likely caused the process to be evicted, triggering cascading failures in downstream payment-service."


Namespace-Scoped Queries

The Admin UI and API support filtering by Kubernetes namespace:

curl http://localhost:8080/api/v1/anomalies \
-H "Authorization: Bearer $YOUR_JWT" \
-G --data-urlencode "k8s_namespace=production"

Runbook Actions for Kubernetes

InfraSage runbooks can execute Kubernetes actions directly:

{
"type": "kubernetes",
"action": "scale",
"namespace": "production",
"deployment": "checkout-api",
"replicas": 5
}
{
"type": "kubernetes",
"action": "rollout-undo",
"namespace": "production",
"deployment": "checkout-api"
}

See Runbooks & Automation for full action reference.


Verification

# Check K8S events are being ingested
docker exec infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev \
--query "SELECT service_id, body, timestamp
FROM infrasage.infrasage_raw_firehose
WHERE type = 'event'
AND attributes LIKE '%k8s%'
ORDER BY timestamp DESC LIMIT 10"