LLM observability in 2026: why OpenTelemetry and evals must run together

A short technical guide for SREs and AI engineers: why LLM observability is different, what to cover with tracing and metrics, and when to add online evals without making them another silo.

LLM ObservabilityOpenTelemetryEvalsSRE
Wasyra Engineering
Modernization, architecture, and reliable delivery
Published
March 30, 2026
min read
8 min read
Categoría
Engineering
OTelde-facto standard for LLM observability

Chapter 01

Why monitoring an LLM is not monitoring a microservice

A microservice can be judged on three numbers: latency, error rate, throughput. An LLM has three additional axes that matter: cost (per token, not per request), subjective quality (whether output is useful), and drift (the same question may produce a different answer tomorrow).

  • Token tracking per request: input, output, and total — and how it maps to USD.
  • Decomposed latency: prompt processing vs decode vs tool-use overhead.
  • Quality: human acceptance, prompt rewrite, session abandonment.
  • Drift: output variation on the same input over time.

Chapter 02

OpenTelemetry + GenAI semantic conventions: the standard baseline

OTel's GenAI Semantic Conventions define common span attributes for LLM calls: model, tokens, cost, tools invoked, embedding used, RAG hit rate. That means your instrumentation is portable: today to Datadog, tomorrow to New Relic, without rewriting.

  • OpenLLMetry (Traceloop) extends OTel for LLMs with non-intrusive instrumentation.
  • Compatible with Datadog, New Relic, Sentry, Honeycomb, and Grafana Cloud.
  • LangSmith and Confident AI add a quality layer on top of tracing.
  • Avoid single-vendor proprietary instrumentation — bet on OTel.

Chapter 03

When to add online evals (and how not to make them a silo)

Online evaluation means running automatic evaluators over traces to flag problematic responses and filter them to a test set. It doesn't replace humans, but it lets you spot issues at scale.

  • Start with tracing and metrics. Only after a stable quarter, add online evals.
  • Hook evaluators into the same OTel backend — don't build a separate dashboard.
  • Use LLM-as-judge only for cases where you trust its consistency.
  • Auto-curate the test set: flagged samples become future regression cases.
Observability and evals are collaborators, not co-required. Start with the first; add the second when production shows the first isn't enough.

Chapter 04

Minimum checklist for production LLMs

If your LLM-backed system is going to take real traffic, before you expose it make sure these six points are covered. If one is missing, you'll learn it the expensive way.

  • Traces with OTel + GenAI semantic conventions, exported to your backend.
  • Cost metrics per feature/user/customer, alerts with caps and rate limit.
  • PII redaction in logs — the “debug log” can turn into a breach.
  • Replay capability: ability to reconstruct any call with its prompt and context.
  • Per-feature kill-switch, not only for the whole service.
  • Runbook with typical scenarios: drift, cost spike, quality regression.

Written by

Wasyra Engineering

Modernization, architecture, and reliable delivery

Wasyra Engineering documents patterns for moving legacy systems without freezing delivery or breaking ownership.

LegacyRefactorArchitecture
More from this author

Series

AI systems that actually reach production

A series on agents, copilots, and guardrails for bringing AI into real work without breaking trust or operations.

Posts in this series

Keep reading

Keep reading