Telemetry Quality Scoring (TQS)
Telemetry Quality Scoring gives every service a continuous, objective grade for the quality of its incoming telemetry. A low score means InfraSage's anomaly detection and RCA capabilities are operating with degraded signal — which is often why incidents get missed. TQS surfaces the specific gaps and tells you exactly what to fix.
Why Telemetry Quality Matters
Anomaly detection is only as good as the data feeding it. Common failure modes:
- A service goes silent for 30 minutes — the watchdog sees nothing to flag
- Logs and traces arrive but no latency metrics are emitted — error-rate detection still works, but RCA has no causal chain
- High cardinality telemetry is sampled to near-zero — baselines become noisy and Z-scores lose meaning
TQS makes these blind spots visible before they cause missed detections.
Scoring Model
Each service receives an Overall Score from 0–100, computed as the sum of five independent dimensions:
| Dimension | Max | What it measures |
|---|---|---|
| Volume | 20 | Raw telemetry event rate over the last hour vs. expected baseline |
| Freshness | 20 | Recency of last received event — penalizes staleness/gaps |
| Variety | 20 | Coverage of signal types: metrics, logs, traces, errors |
| Consistency | 20 | Temporal stability of the arrival rate — penalizes bursty/irregular delivery |
| Error Signal | 20 | Presence of error-rate and latency metrics needed for incident detection |
Score Bands
| Band | Score | Meaning |
|---|---|---|
| Excellent | 80–100 | Full detection capability — all signals healthy |
| Good | 60–79 | Minor gaps — detection largely intact |
| Partial | 40–59 | Meaningful gaps — some incident types will be missed |
| Blind | 0–39 | Severely degraded signal — do not rely on automated detection |
Causal Readiness Score
In addition to the 0–100 Overall Score, TQS computes a separate Causal Readiness Score from 0–10. This indicates how well the service's telemetry supports Causal Anomaly Detection (CIAD).
Causal Readiness is informational and does not contribute to the Overall Score:
| Value | Meaning |
|---|---|
| 8–10 | CIAD is fully operational for this service |
| 4–7 | CIAD active but operating on a subset of causal invariants |
| 0–3 | Insufficient embedding history for causal detection |
Improvement Actions
When a dimension scores below its threshold, TQS generates actionable recommendations. Examples:
- "Increase Prometheus scrape frequency — current gap of 3 min exceeds the 1 min freshness target"
- "Error-rate metric
http_server_errors_totalis missing — ensure your HTTP middleware emits it" - "Log volume dropped 94% in the last hour — check your log shipper and sampling configuration"
Improvement Actions appear in the UI on the Telemetry Quality page and on each Service Detail page.
Reporting a Missed Detection
If InfraSage failed to detect an incident, the Missed Detection feedback loop lets you report it:
POST /api/v1/services/{service_id}/missed-detection
Content-Type: application/json
{
"incident_start": "2026-05-09T02:00:00Z",
"incident_end": "2026-05-09T02:45:00Z",
"description": "Payment latency spiked to 8 s, no alert was fired"
}
InfraSage will:
- Re-examine telemetry quality during the incident window
- Identify which dimensions were degraded at that time
- Return a
diagnosis_codeand ranked remediation steps
This feedback is used to continuously improve detection thresholds and baseline calibration.
Viewing Telemetry Quality
Via the UI
Navigate to Telemetry Quality in the sidebar. The fleet view shows all services sorted worst-first, with band badges and mini dimension bars. Click any row to expand the full dimension breakdown and improvement actions.
On a Service Detail page, the Telemetry Quality section appears directly below the weirdness score timeline.
Via API
# Fleet-wide latest scores
curl $INFRASAGE_URL/api/v1/telemetry-quality \
-H "Authorization: Bearer $TOKEN"
# On-demand compute for a single service
curl $INFRASAGE_URL/api/v1/services/payment-api/telemetry-quality \
-H "Authorization: Bearer $TOKEN"
Example response:
{
"service_id": "payment-api",
"computed_at": "2026-05-09T04:30:00Z",
"overall_score": 74,
"band": "good",
"volume_score": 18,
"freshness_score": 20,
"variety_score": 14,
"consistency_score": 12,
"error_signal_score": 10,
"causal_readiness_score": 9,
"improvement_actions": [
"Increase variety: traces are missing — enable OpenTelemetry trace export",
"Error signal: http_server_errors_total not found in last 6h — check instrumentation"
]
}
Via ClickHouse SQL
-- Latest score per service
SELECT
service_id,
overall_score,
band,
volume_score,
freshness_score,
variety_score,
consistency_score,
error_signal_score,
causal_readiness_score,
computed_at
FROM infrasage_telemetry_quality_scores
ORDER BY service_id, computed_at DESC
LIMIT 1 BY service_id;
-- Services in "blind" or "partial" band
SELECT service_id, overall_score, band
FROM infrasage_telemetry_quality_scores
WHERE band IN ('blind', 'partial')
ORDER BY overall_score ASC
LIMIT 1 BY service_id;
Scoring Frequency
TQS scores are computed on demand (per-request for the API endpoint) and also persisted periodically by the AIops Engine background worker. Persisted scores are used to power the fleet overview table and historical trend queries.
Related
- Causal Anomaly Detection — the CIAD pre-fault detection system that Causal Readiness Score reflects
- Anomaly Detection — the statistical and ML watchdog that TQS feeds into