Docs / Concepts

Failure classes and playbooks.

Mesedi runs ten detectors over the telemetry your SDK ships. When a detector fires, related events cluster into a failure group, and the Tier 1 Playbook for that class renders alongside the group in the dashboard. The playbook explains what the signal means and what to try first. Each section below names the detector, what it looks at, and the rough shape of the playbook.

Crashes

An execution exited via an unhandled exception. The SDK records the exception type plus a stable hash of the top of the traceback (the crash signature), so retries of the same bug cluster into one failure group instead of paging on every attempt.

Signal: Same crash_signature appears more than once.

Playbook (Tier 1): Read the traceback in the dashboard, fix the bug, redeploy. If the same signature reappears after the deploy, the fix didn't land or didn't cover the path that fires.

Time budget exceeded

An execution ran longer than the time budget configured on @wrap. The detector fires when duration_ms exceeds the budget, even if the function eventually returned successfully.

Signal: duration_ms > policy.max_wall_seconds * 1000.

Playbook (Tier 1): Either raise the budget intentionally (and document why) or cut latency. Common culprits: tool calls hitting slow upstreams, unintentional ReAct loops with no maxSteps, or a model that's switching to a slower variant.

Step count exceeded

The number of LLM calls in one execution exceeded the configured maxSteps. ReAct-style agents that can't decide when to stop look like this.

Signal: Count of llm_call events on an execution > max_steps.

Playbook (Tier 1): Tighten your stopping criterion, add a final-answer validator, or lower maxSteps. If the agent is genuinely stuck, the surrounding loop detectors will also fire.

Tool failures

A tool call inside the execution failed with an exception or returned a result the calling code marked failed. The detector clusters repeated failures of the same tool inside one execution and across executions.

Signal: tool_call event with status=failed; same (tool_name, exception_type) seen more than once.

Playbook (Tier 1): Add retry-with-backoff at the tool boundary, or have the agent re-plan when a tool fails consistently. If the upstream tool is genuinely down, the upstream needs the fix, not the agent.

Validator failures

An explicit output validator (schema check, content policy, downstream parser) rejected the final answer. Captured by emitting a validator_check event with status=failed.

Signal: validator_check event with status=failed.

Playbook (Tier 1): Strengthen the prompt or add a repair step. Repeated validator failures on the same schema usually mean the model is producing structurally-plausible-but-wrong output that only the validator catches; that's a signal to upgrade the validator and re-run.

Prompt injection

User-supplied content appears to have overridden system instructions. The detector looks for known injection patterns (ignore previous instructions, system: ..., role-confusion markers) plus a heuristic that flags model outputs that explicitly reference the injected directive.

Signal: llm_call event where the user content matches an injection pattern, or the model output explicitly cites instructions absent from the system prompt.

Playbook (Tier 1): Strip or sanitize the injecting content, harden the system prompt with explicit precedence rules, and add a validator that rejects answers referencing instructions not in the system prompt.

Cost velocity

Dollar spend per execution is rising over time, even when the workload looks the same. Could be a model upgrade, a prompt that grew silently, or a retry storm.

Signal: Rolling mean cost per execution increased above the configured threshold relative to the prior window.

Playbook (Tier 1): Diff the recent prompt against the previous one, check the model id (auto-upgrades from one provider tier to another happen quietly), and verify retries aren't fanning out invisibly.

Drift

The mix of models or the texture of prompts shifted over time. Mesedi tracks model-mix shifts (model A used to be 90%, now 60%) and lexical drift (character-3-gram cosine distance) on prompts and outputs.

Signal: Model-mix entropy or 3-gram cosine distance from the baseline window exceeds threshold.

Playbook (Tier 1): Was the change intentional? If yes, snapshot a new baseline. If no, find the upstream change (provider deprecation, prompt-template tweak, A/B test rollout) and decide whether to revert.

Identical-call loop

The same LLM call (identical model, identical messages) repeated within one execution. A clean indicator of an agent stuck in a no-progress cycle.

Signal: Same (model, hash(messages)) tuple appears more than N times in one execution.

Playbook (Tier 1): Add a no-progress check or a stop-after-K-identical-calls heuristic. The hard-halt mechanism with a tight step count is the bluntest version of this.

Similar-call loop

Like identical-call, but the model or messages differ slightly (paraphrase, swapped tool name). Catches looping agents that flutter just enough to evade the identical-call detector.

Signal: Pairs of llm_call events with high lexical similarity (cosine > threshold) within one execution.

Playbook (Tier 1): Same as identical-call: add a no-progress check, log what the agent is trying to vary, and consider whether the prompt needs an explicit don't-repeat rule.

How playbooks render

When a failure group opens, the dashboard renders the matching playbook in the same view. Playbook content lives as Markdown under backend/internal/playbooks/content/<class>/ in the monorepo. Playbook lookup is (class, signature) with fallback to a per-class default, so very common signatures can eventually get their own targeted playbook without touching the default.

Tier 1 is the recommendation surface that ships today. Tier 2 (suggested diff) and Tier 3 (auto-fix) are on the v2 roadmap.

What's next?

Self-hosting guide covers running the Go backend behind your own infrastructure if you'd rather not depend on the hosted Cloud version.

HTTP API reference covers the wire format detectors read from.