LLM observability in 2026: why OpenTelemetry and evals must run together
A short technical guide for SREs and AI engineers: why LLM observability is different, what to cover with tracing and metrics, and when to add online evals without making them another silo.
- Published
- March 30, 2026
- min read
- 8 min read
- Categoría
- Engineering
On this page
4 chaptersChapter 01
Why monitoring an LLM is not monitoring a microservice
A microservice can be judged on three numbers: latency, error rate, throughput. An LLM has three additional axes that matter: cost (per token, not per request), subjective quality (whether output is useful), and drift (the same question may produce a different answer tomorrow).
- Token tracking per request: input, output, and total — and how it maps to USD.
- Decomposed latency: prompt processing vs decode vs tool-use overhead.
- Quality: human acceptance, prompt rewrite, session abandonment.
- Drift: output variation on the same input over time.
Chapter 02
OpenTelemetry + GenAI semantic conventions: the standard baseline
OTel's GenAI Semantic Conventions define common span attributes for LLM calls: model, tokens, cost, tools invoked, embedding used, RAG hit rate. That means your instrumentation is portable: today to Datadog, tomorrow to New Relic, without rewriting.
- OpenLLMetry (Traceloop) extends OTel for LLMs with non-intrusive instrumentation.
- Compatible with Datadog, New Relic, Sentry, Honeycomb, and Grafana Cloud.
- LangSmith and Confident AI add a quality layer on top of tracing.
- Avoid single-vendor proprietary instrumentation — bet on OTel.
Chapter 03
When to add online evals (and how not to make them a silo)
Online evaluation means running automatic evaluators over traces to flag problematic responses and filter them to a test set. It doesn't replace humans, but it lets you spot issues at scale.
- Start with tracing and metrics. Only after a stable quarter, add online evals.
- Hook evaluators into the same OTel backend — don't build a separate dashboard.
- Use LLM-as-judge only for cases where you trust its consistency.
- Auto-curate the test set: flagged samples become future regression cases.
Chapter 04
Minimum checklist for production LLMs
If your LLM-backed system is going to take real traffic, before you expose it make sure these six points are covered. If one is missing, you'll learn it the expensive way.
- Traces with OTel + GenAI semantic conventions, exported to your backend.
- Cost metrics per feature/user/customer, alerts with caps and rate limit.
- PII redaction in logs — the “debug log” can turn into a breach.
- Replay capability: ability to reconstruct any call with its prompt and context.
- Per-feature kill-switch, not only for the whole service.
- Runbook with typical scenarios: drift, cost spike, quality regression.
Written by
Wasyra Engineering
Modernization, architecture, and reliable delivery
Wasyra Engineering documents patterns for moving legacy systems without freezing delivery or breaking ownership.
Series
AI systems that actually reach production
A series on agents, copilots, and guardrails for bringing AI into real work without breaking trust or operations.
Posts in this seriesMore from this author
More from this author
Engineering
B2B SaaS technical due diligence checklist before you invest
What to review in architecture, security, data, debt, observability, and delivery before buying, investing in, or scaling a B2B SaaS.
ArticleEngineering
Legacy modernization roadmap for SaaS without slowing the business
How to split SaaS modernization by routes, contracts, data, and operations to reduce risk without freezing sales or delivery.
ArticleKeep reading
Keep reading
AI Systems
MCP in production: the protocol standardizing your AI agents in 2026
Model Context Protocol went from experiment to de-facto standard in twelve months. Why Gartner expects 40% of enterprise apps to use it by end of 2026.
ArticleAI Systems
Top 5 AI and product development news to watch now
Five recent moves from OpenAI, GitHub, AWS, and Anthropic that change how teams design, build, and operate software.
ArticleStrategy
AI safety and EU AI Act 2026: why agent red teaming is no longer optional
On August 2, 2026, the high-risk rules come into force. Fines of up to €35M or 7% of global revenue. What your agent needs to pass.
Article