Skip to main content

Telemetry Quality Scoring (TQS)

Telemetry Quality Scoring gives every service a continuous, objective grade for the quality of its incoming telemetry. A low score means InfraSage's anomaly detection and RCA capabilities are operating with degraded signal — which is often why incidents get missed. TQS surfaces the specific gaps and tells you exactly what to fix.


Why Telemetry Quality Matters

Anomaly detection is only as good as the data feeding it. Common failure modes:

  • A service goes silent for 30 minutes — the watchdog sees nothing to flag
  • Logs and traces arrive but no latency metrics are emitted — error-rate detection still works, but RCA has no causal chain
  • High cardinality telemetry is sampled to near-zero — baselines become noisy and Z-scores lose meaning

TQS makes these blind spots visible before they cause missed detections.


Scoring Model

Each service receives an Overall Score from 0–100, computed as the sum of five independent dimensions:

DimensionMaxWhat it measures
Volume20Raw telemetry event rate over the last hour vs. expected baseline
Freshness20Recency of last received event — penalizes staleness/gaps
Variety20Coverage of signal types: metrics, logs, traces, errors
Consistency20Temporal stability of the arrival rate — penalizes bursty/irregular delivery
Error Signal20Presence of error-rate and latency metrics needed for incident detection

Score Bands

BandScoreMeaning
Excellent80–100Full detection capability — all signals healthy
Good60–79Minor gaps — detection largely intact
Partial40–59Meaningful gaps — some incident types will be missed
Blind0–39Severely degraded signal — do not rely on automated detection

Causal Readiness Score

In addition to the 0–100 Overall Score, TQS computes a separate Causal Readiness Score from 0–10. This indicates how well the service's telemetry supports Causal Anomaly Detection (CIAD).

Causal Readiness is informational and does not contribute to the Overall Score:

ValueMeaning
8–10CIAD is fully operational for this service
4–7CIAD active but operating on a subset of causal invariants
0–3Insufficient embedding history for causal detection

Improvement Actions

When a dimension scores below its threshold, TQS generates actionable recommendations. Examples:

  • "Increase Prometheus scrape frequency — current gap of 3 min exceeds the 1 min freshness target"
  • "Error-rate metric http_server_errors_total is missing — ensure your HTTP middleware emits it"
  • "Log volume dropped 94% in the last hour — check your log shipper and sampling configuration"

Improvement Actions appear in the UI on the Telemetry Quality page and on each Service Detail page.


Reporting a Missed Detection

If InfraSage failed to detect an incident, the Missed Detection feedback loop lets you report it:

POST /api/v1/services/{service_id}/missed-detection
Content-Type: application/json

{
"incident_start": "2026-05-09T02:00:00Z",
"incident_end": "2026-05-09T02:45:00Z",
"description": "Payment latency spiked to 8 s, no alert was fired"
}

InfraSage will:

  1. Re-examine telemetry quality during the incident window
  2. Identify which dimensions were degraded at that time
  3. Return a diagnosis_code and ranked remediation steps

This feedback is used to continuously improve detection thresholds and baseline calibration.


Viewing Telemetry Quality

Via the UI

Navigate to Telemetry Quality in the sidebar. The fleet view shows all services sorted worst-first, with band badges and mini dimension bars. Click any row to expand the full dimension breakdown and improvement actions.

On a Service Detail page, the Telemetry Quality section appears directly below the weirdness score timeline.

Via API

# Fleet-wide latest scores
curl $INFRASAGE_URL/api/v1/telemetry-quality \
-H "Authorization: Bearer $TOKEN"

# On-demand compute for a single service
curl $INFRASAGE_URL/api/v1/services/payment-api/telemetry-quality \
-H "Authorization: Bearer $TOKEN"

Example response:

{
"service_id": "payment-api",
"computed_at": "2026-05-09T04:30:00Z",
"overall_score": 74,
"band": "good",
"volume_score": 18,
"freshness_score": 20,
"variety_score": 14,
"consistency_score": 12,
"error_signal_score": 10,
"causal_readiness_score": 9,
"improvement_actions": [
"Increase variety: traces are missing — enable OpenTelemetry trace export",
"Error signal: http_server_errors_total not found in last 6h — check instrumentation"
]
}

Via ClickHouse SQL

-- Latest score per service
SELECT
service_id,
overall_score,
band,
volume_score,
freshness_score,
variety_score,
consistency_score,
error_signal_score,
causal_readiness_score,
computed_at
FROM infrasage_telemetry_quality_scores
ORDER BY service_id, computed_at DESC
LIMIT 1 BY service_id;

-- Services in "blind" or "partial" band
SELECT service_id, overall_score, band
FROM infrasage_telemetry_quality_scores
WHERE band IN ('blind', 'partial')
ORDER BY overall_score ASC
LIMIT 1 BY service_id;

Scoring Frequency

TQS scores are computed on demand (per-request for the API endpoint) and also persisted periodically by the AIops Engine background worker. Persisted scores are used to power the fleet overview table and historical trend queries.