Telemetry Quality Scoring (TQS)

Telemetry Quality Scoring gives every service a continuous, objective grade for the quality of its incoming telemetry. A low score means InfraSage's anomaly detection and RCA capabilities are operating with degraded signal — which is often why incidents get missed. TQS surfaces the specific gaps and tells you exactly what to fix.

Why Telemetry Quality Matters

Anomaly detection is only as good as the data feeding it. Common failure modes:

A service goes silent for 30 minutes — the watchdog sees nothing to flag
Logs and traces arrive but no latency metrics are emitted — error-rate detection still works, but RCA has no causal chain
High cardinality telemetry is sampled to near-zero — baselines become noisy and Z-scores lose meaning

TQS makes these blind spots visible before they cause missed detections.

Scoring Model

Each service receives an Overall Score from 0–100, computed as the sum of five independent dimensions:

Dimension	Max	What it measures
Volume	20	Raw telemetry event rate over the last hour vs. expected baseline
Freshness	20	Recency of last received event — penalizes staleness/gaps
Variety	20	Coverage of signal types: metrics, logs, traces, errors
Consistency	20	Temporal stability of the arrival rate — penalizes bursty/irregular delivery
Error Signal	20	Presence of error-rate and latency metrics needed for incident detection

Score Bands

Band	Score	Meaning
Excellent	80–100	Full detection capability — all signals healthy
Good	60–79	Minor gaps — detection largely intact
Partial	40–59	Meaningful gaps — some incident types will be missed
Blind	0–39	Severely degraded signal — do not rely on automated detection

Causal Readiness Score

In addition to the 0–100 Overall Score, TQS computes a separate Causal Readiness Score from 0–10. This indicates how well the service's telemetry supports Causal Anomaly Detection (CIAD).

Causal Readiness is informational and does not contribute to the Overall Score:

Value	Meaning
8–10	CIAD is fully operational for this service
4–7	CIAD active but operating on a subset of causal invariants
0–3	Insufficient embedding history for causal detection

Improvement Actions

When a dimension scores below its threshold, TQS generates actionable recommendations. Examples:

"Increase Prometheus scrape frequency — current gap of 3 min exceeds the 1 min freshness target"
"Error-rate metric http_server_errors_total is missing — ensure your HTTP middleware emits it"
"Log volume dropped 94% in the last hour — check your log shipper and sampling configuration"

Improvement Actions appear in the UI on the Telemetry Quality page and on each Service Detail page.

Reporting a Missed Detection

If InfraSage failed to detect an incident, the Missed Detection feedback loop lets you report it:

POST /api/v1/services/{service_id}/missed-detection
Content-Type: application/json

{
  "incident_start": "2026-05-09T02:00:00Z",
  "incident_end":   "2026-05-09T02:45:00Z",
  "description":    "Payment latency spiked to 8 s, no alert was fired"
}

InfraSage will:

Re-examine telemetry quality during the incident window
Identify which dimensions were degraded at that time
Return a diagnosis_code and ranked remediation steps

This feedback is used to continuously improve detection thresholds and baseline calibration.

Viewing Telemetry Quality

Via the UI

Navigate to Telemetry Quality in the sidebar. The fleet view shows all services sorted worst-first, with band badges and mini dimension bars. Click any row to expand the full dimension breakdown and improvement actions.

On a Service Detail page, the Telemetry Quality section appears directly below the weirdness score timeline.

Via API

# Fleet-wide latest scores
curl $INFRASAGE_URL/api/v1/telemetry-quality \
  -H "Authorization: Bearer $TOKEN"

# On-demand compute for a single service
curl $INFRASAGE_URL/api/v1/services/payment-api/telemetry-quality \
  -H "Authorization: Bearer $TOKEN"

Example response:

{
  "service_id": "payment-api",
  "computed_at": "2026-05-09T04:30:00Z",
  "overall_score": 74,
  "band": "good",
  "volume_score": 18,
  "freshness_score": 20,
  "variety_score": 14,
  "consistency_score": 12,
  "error_signal_score": 10,
  "causal_readiness_score": 9,
  "improvement_actions": [
    "Increase variety: traces are missing — enable OpenTelemetry trace export",
    "Error signal: http_server_errors_total not found in last 6h — check instrumentation"
  ]
}

Via ClickHouse SQL

-- Latest score per service
SELECT
  service_id,
  overall_score,
  band,
  volume_score,
  freshness_score,
  variety_score,
  consistency_score,
  error_signal_score,
  causal_readiness_score,
  computed_at
FROM infrasage_telemetry_quality_scores
ORDER BY service_id, computed_at DESC
LIMIT 1 BY service_id;

-- Services in "blind" or "partial" band
SELECT service_id, overall_score, band
FROM infrasage_telemetry_quality_scores
WHERE band IN ('blind', 'partial')
ORDER BY overall_score ASC
LIMIT 1 BY service_id;

Scoring Frequency

TQS scores are computed on demand (per-request for the API endpoint) and also persisted periodically by the AIops Engine background worker. Persisted scores are used to power the fleet overview table and historical trend queries.

Causal Anomaly Detection — the CIAD pre-fault detection system that Causal Readiness Score reflects
Anomaly Detection — the statistical and ML watchdog that TQS feeds into

Why Telemetry Quality Matters​

Scoring Model​

Score Bands​

Causal Readiness Score​

Improvement Actions​

Reporting a Missed Detection​

Viewing Telemetry Quality​

Via the UI​

Via API​

Via ClickHouse SQL​

Scoring Frequency​

Related​