What is Wasyra's App in 1 Week service?

It's our express program that delivers a functional MVP in just 7 days. It includes strategy, UI/UX design, full-stack development with integrated AI and production deployment. Day 1-2: Strategy and Design. Day 3-5: Intensive Development. Day 6-7: Testing and Launch.

What artificial intelligence services does Wasyra offer?

We offer 14 AI services: autonomous agents, RAG & knowledge base, LLM fine-tuning, chatbots, computer vision, NLP, generative AI, MLOps, predictive models, recommendation systems, enterprise AI copilots, prompt engineering, AI safety and voice AI.

Does Wasyra offer staff augmentation?

Yes, we offer 10 talent models: individual staff augmentation, dedicated teams (Dev + QA + PM), team as a service, build-operate-transfer, nearshore for US/Canada, CTO as a service, tech recruitment, talent vetting, on-demand experts and managed services.

Where is Wasyra located?

Wasyra has presence in Lima, Peru and Los Angeles, US. We offer nearshore services for companies in the United States and Canada, with teams in a convenient time zone.

LLM observability and operationSeriesAI systems that actually reach production

LLM observability in 2026: why OpenTelemetry and evals must run together

A short technical guide for SREs and AI engineers: why LLM observability is different, what to cover with tracing and metrics, and when to add online evals without making them another silo.

LLM ObservabilityOpenTelemetryEvalsSRE

Wasyra Engineering

Modernization, architecture, and reliable delivery

Published: March 30, 2026
min read: 2 min read
Categoría: Engineering

On this page

4 chapters

01Why monitoring an LLM is not monitoring a microservice
02OpenTelemetry + GenAI semantic conventions: the standard baseline
03When to add online evals (and how not to make them a silo)
04Minimum checklist for production LLMs

OTelde-facto standard for LLM observability

Chapter 01

Why monitoring an LLM is not monitoring a microservice

A microservice can be judged on three numbers: latency, error rate, throughput. An LLM has three additional axes that matter: cost (per token, not per request), subjective quality (whether output is useful), and drift (the same question may produce a different answer tomorrow).

Token tracking per request: input, output, and total — and how it maps to USD.
Decomposed latency: prompt processing vs decode vs tool-use overhead.
Quality: human acceptance, prompt rewrite, session abandonment.
Drift: output variation on the same input over time.

Chapter 02

OpenTelemetry + GenAI semantic conventions: the standard baseline

OTel's GenAI Semantic Conventions define common span attributes for LLM calls: model, tokens, cost, tools invoked, embedding used, RAG hit rate. That means your instrumentation is portable: today to Datadog, tomorrow to New Relic, without rewriting.

OpenLLMetry (Traceloop) extends OTel for LLMs with non-intrusive instrumentation.
Compatible with Datadog, New Relic, Sentry, Honeycomb, and Grafana Cloud.
LangSmith and Confident AI add a quality layer on top of tracing.
Avoid single-vendor proprietary instrumentation — bet on OTel.

Source: OpenTelemetry, Observability for LLM-based applications Source: OpenObserve, OpenTelemetry for LLMs SRE Guide 2026 Source: TokenMix, OpenLLMetry Explained 2026

Chapter 03

When to add online evals (and how not to make them a silo)

Online evaluation means running automatic evaluators over traces to flag problematic responses and filter them to a test set. It doesn't replace humans, but it lets you spot issues at scale.

Start with tracing and metrics. Only after a stable quarter, add online evals.
Hook evaluators into the same OTel backend — don't build a separate dashboard.
Use LLM-as-judge only for cases where you trust its consistency.
Auto-curate the test set: flagged samples become future regression cases.

Observability and evals are collaborators, not co-required. Start with the first; add the second when production shows the first isn't enough.

Chapter 04

Minimum checklist for production LLMs

If your LLM-backed system is going to take real traffic, before you expose it make sure these six points are covered. If one is missing, you'll learn it the expensive way.

Traces with OTel + GenAI semantic conventions, exported to your backend.
Cost metrics per feature/user/customer, alerts with caps and rate limit.
PII redaction in logs — the “debug log” can turn into a breach.
Replay capability: ability to reconstruct any call with its prompt and context.
Per-feature kill-switch, not only for the whole service.
Runbook with typical scenarios: drift, cost spike, quality regression.

Written by

Wasyra Engineering

Modernization, architecture, and reliable delivery

Wasyra Engineering documents patterns for moving legacy systems without freezing delivery or breaking ownership.

LegacyRefactorArchitecture

AI systems that actually reach production

A series on agents, copilots, and guardrails for bringing AI into real work without breaking trust or operations.

Posts in this series

Keep reading

AI Systems

MCP in production: the protocol standardizing your AI agents in 2026

Model Context Protocol went from experiment to de-facto standard in twelve months. Why Gartner expects 40% of enterprise apps to use it by end of 2026.

Article

AI Systems

Top 5 AI and product development news to watch now

Five recent moves from OpenAI, GitHub, AWS, and Anthropic that change how teams design, build, and operate software.

Article

Strategy

AI safety and EU AI Act 2026: why agent red teaming is no longer optional

On August 2, 2026, the high-risk rules come into force. Fines of up to €35M or 7% of global revenue. What your agent needs to pass.

Article

LLM observability in 2026: why OpenTelemetry and evals must run together

Why monitoring an LLM is not monitoring a microservice

OpenTelemetry + GenAI semantic conventions: the standard baseline

When to add online evals (and how not to make them a silo)

Minimum checklist for production LLMs

Wasyra Engineering

AI systems that actually reach production

More from this author

B2B SaaS technical due diligence checklist before you invest

Legacy modernization roadmap for SaaS without slowing the business

Keep reading

MCP in production: the protocol standardizing your AI agents in 2026

Top 5 AI and product development news to watch now

AI safety and EU AI Act 2026: why agent red teaming is no longer optional