Kubernetes
InfraSage monitors your Kubernetes cluster by ingesting pod/node metrics, namespace events, and deployment lifecycle events.
What InfraSage Collects
| Signal | Source | Examples |
|---|---|---|
| Pod metrics | Kubernetes Metrics API | CPU request/limit utilization, memory usage |
| Node metrics | Kubernetes Metrics API | Node CPU, memory, disk pressure |
| Pod events | Kubernetes Events API | OOMKilled, CrashLoopBackOff, FailedScheduling |
| Deployment events | Kubernetes Events API | Scaled, RolledBack, RolloutComplete |
Configuration
RBAC for InfraSage
Create a ServiceAccount with the necessary permissions:
apiVersion: v1
kind: ServiceAccount
metadata:
name: infrasage-poller
namespace: infrasage
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: infrasage-poller
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "events", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: infrasage-poller
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: infrasage-poller
subjects:
- kind: ServiceAccount
name: infrasage-poller
namespace: infrasage
Environment Variables
# Kubernetes API server (auto-detected when running in-cluster)
KUBERNETES_API_SERVER=https://kubernetes.default.svc
KUBERNETES_SERVICE_ACCOUNT_TOKEN_PATH=/var/run/secrets/kubernetes.io/serviceaccount/token
# Polling interval
KUBERNETES_POLL_INTERVAL_SECONDS=30
# Filter by namespace (comma-separated; empty = all namespaces)
KUBERNETES_NAMESPACES=production,staging
Kubernetes Events in RCA
When InfraSage runs RCA on a metric anomaly, it automatically queries for Kubernetes events that occurred within ±5 minutes of the anomaly:
- OOMKilled → surfaced as contributing factor to memory spikes
- CrashLoopBackOff → correlated with elevated error rates
- FailedScheduling → correlated with latency spikes during deployments
- RolledBack → correlated with sudden metric improvement
This enrichment helps Claude provide highly specific root cause explanations like:
"The error rate spike at 14:03 is causally linked to an OOMKill event on checkout-api pod checkout-api-7f9d-4b2x at 14:01. Memory pressure likely caused the process to be evicted, triggering cascading failures in downstream payment-service."
Namespace-Scoped Queries
The Admin UI and API support filtering by Kubernetes namespace:
curl http://localhost:8080/api/v1/anomalies \
-H "Authorization: Bearer $YOUR_JWT" \
-G --data-urlencode "k8s_namespace=production"
Runbook Actions for Kubernetes
InfraSage runbooks can execute Kubernetes actions directly:
{
"type": "kubernetes",
"action": "scale",
"namespace": "production",
"deployment": "checkout-api",
"replicas": 5
}
{
"type": "kubernetes",
"action": "rollout-undo",
"namespace": "production",
"deployment": "checkout-api"
}
See Runbooks & Automation for full action reference.
Verification
# Check K8S events are being ingested
docker exec infrasage-clickhouse clickhouse-client \
--user infrasage --password infrasage-dev \
--query "SELECT service_id, body, timestamp
FROM infrasage.infrasage_raw_firehose
WHERE type = 'event'
AND attributes LIKE '%k8s%'
ORDER BY timestamp DESC LIMIT 10"