Skip to main content

Self-Observability

infrasagent exposes its own operational metrics at the admin API's /metrics endpoint in Prometheus text format. This lets you monitor the agent's health, throughput, and error rate using your existing observability stack.


Metrics

Throughput

MetricLabelsDescription
infrasage_agent_records_received_totalsource, signalTotal records accepted by each source
infrasage_agent_records_exported_totalsink, signalTotal records successfully exported by each sink

Errors

MetricLabelsDescription
infrasage_agent_export_errors_totalsinkTotal failed export attempts

Latency

MetricLabelsDescription
infrasage_agent_export_duration_secondssinkHistogram of time spent in each export call

Queue

MetricLabelsDescription
infrasage_agent_queue_depthcomponentCurrent number of records waiting in each channel

Querying the Metrics

# Overall export rate
curl -s http://localhost:8080/metrics | grep records_exported

# Error count per sink
curl -s http://localhost:8080/metrics | grep export_errors

# Export latency percentiles (requires Prometheus scraping the agent)
histogram_quantile(0.99,
rate(infrasage_agent_export_duration_seconds_bucket[5m])
)

Admin API Endpoints

EndpointMethodDescription
/healthGETReturns ok when the agent is running
/readyGETReturns ok when the pipeline is started and all sinks are connected
/metricsGETPrometheus metrics for the agent itself
/topologyGETJSON representation of the active pipeline DAG
/reloadPOSTHot-reload the config file (requires API key if api_keys is set)

/topology Response

{
"sources": ["otlp_in", "host_metrics"],
"processors": ["k8s_enrich", "batch_main"],
"sinks": ["infrasage"]
}

Securing the Admin API

api:
listen: "0.0.0.0:8080"
api_keys: ["${ADMIN_API_KEY}"] # required for /reload; /health and /metrics are public
read_only: true # disallow /reload entirely
tls:
cert_file: /etc/ssl/infrasagent/cert.pem
key_file: /etc/ssl/infrasagent/key.pem

Send the key with X-API-Key or Authorization: Bearer:

curl -H "X-API-Key: ${ADMIN_API_KEY}" \
-X POST http://localhost:8080/reload

Scraping with Prometheus

Add a scrape job to your Prometheus config:

scrape_configs:
- job_name: infrasagent
static_configs:
- targets: ["localhost:8080"]
metrics_path: /metrics
scrape_interval: 15s

Or use the Prometheus Operator PodMonitor:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: infrasagent
namespace: infrasage
spec:
selector:
matchLabels:
app: infrasagent
podMetricsEndpoints:
- port: admin
path: /metrics
interval: 15s

# Alert when a sink is consistently failing
- alert: InfraSageAgentExportErrors
expr: rate(infrasage_agent_export_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "infrasagent export errors on sink {{ $labels.sink }}"

# Alert when export latency is high
- alert: InfraSageAgentSlowExport
expr: |
histogram_quantile(0.95,
rate(infrasage_agent_export_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "infrasagent p95 export latency > 5s on sink {{ $labels.sink }}"

# Alert when queue depth is growing
- alert: InfraSageAgentQueueBackpressure
expr: infrasage_agent_queue_depth > 5000
for: 2m
labels:
severity: warning
annotations:
summary: "infrasagent queue depth high on {{ $labels.component }}"