Glossary

The AI agent observability glossary

Definitions for the terms teams use when they build, debug, and keep improving production AI agents — from continuous learning to behavioral failure to self-improving agents. Written by the team building Moda.

Jump to a letter

A

Agent laziness

A behavioral failure where the agent declines, hedges, or gives a generic answer to a task it is capable of completing.

Agent laziness is a class of behavioral failure in AI agents where the model returns 'I cannot help with that', refuses an in-scope request, or substitutes a high-level non-answer when the user asked for specifics. It usually traces back to overly cautious system prompts, safety filters tuned too aggressively, or instruction-following regressions after a model upgrade. Moda detects laziness by comparing the user's stated goal to the agent's trajectory and flagging mismatches at the conversation level.

RelatedBehavioral failure, Goal drift, Frustration root cause

AI agent observability

Also known as: Agent observability, Agentic observability

Monitoring AI agents in production by analyzing full conversations, not just individual API calls.

AI agent observability is the practice of understanding what production AI agents are doing, why they succeed or fail, and how users are experiencing them. Traditional APM and LLM tracing tools record what an agent did at the call or token level. Agent observability operates at the conversation level: it clusters user intents, detects behavioral failures (tool misuse, context loss, hallucinations), and surfaces frustration with root causes. Moda is purpose-built for this layer.

RelatedLLM observability, Behavioral failure, Intent discovery

B

Behavioral failure

A failure mode that occurs at the conversation level, not at the API call level — invisible to traces alone.

A behavioral failure is when an AI agent does the wrong thing while every individual API call returns 200 OK. Examples include calling the right tool with subtly wrong arguments (tool misuse), forgetting earlier turns (context loss), refusing in-scope tasks (agent laziness), fabricating facts (hallucinations), retrying the same failed action (reasoning loops), and silently abandoning the user's objective (goal drift). Detecting behavioral failures requires comparing trajectories and outcomes, not just status codes.

RelatedTool misuse, Context loss, Agent laziness, Hallucination, Reasoning loops, Goal drift

C

Cluster hierarchy

Also known as: Cluster Hierarchy V2, Hierarchical clustering

Three-level segment clustering — Category → Subcategory → Cluster — built with HDBSCAN over Qwen3 embeddings.

Moda's cluster hierarchy groups conversation segments into a three-level taxonomy: broad categories, mid-level subcategories, and tight clusters. The hierarchy is built with HDBSCAN over 4096-D Qwen3-Embedding-8B embeddings, with UMAP projection for visualization. New segments are assigned online via kNN so the taxonomy stays current without rerunning the full clustering job. Cluster labels are generated with TF-IDF plus Claude Haiku 4.5.

RelatedIntent discovery, Segment, Conversation analytics

Context loss

A behavioral failure where the agent forgets earlier turns, contradicts itself, or asks the user to re-supply known information.

Context loss happens when an AI agent loses track of information from earlier in the conversation — forgetting user preferences, repeating questions already answered, or contradicting prior statements. It is common in long sessions, multi-step workflows, and conversations that exceed the model's effective attention window. Context loss is a behavioral failure: every individual call may succeed, but the conversation as a whole degrades.

RelatedBehavioral failure, Frustration root cause

Continuous learning

Also known as: Continual learning, Learning loop

An AI system that improves over time by feeding production signal — user intent, failures, corrections — back into the next iteration of the agent.

Continuous learning (sometimes called continual learning) is the practice of closing the loop between what an agent does in production and how the agent is built. In an agent context it means: ingest every production conversation, mine it for what users actually want and where the agent fails, then route those signals into evals, prompts, fine-tuning, tool definitions, and routing rules. Moda is built as the discovery and debug half of that loop — production conversations in, continually improving agents out.

RelatedSelf-improving agent, Intent discovery, Behavioral failure, Moda

Conversation analytics

Also known as: LLM conversation analytics, Conversation intelligence

Analytics that treat full agent conversations — not individual calls — as the unit of analysis.

Conversation analytics for AI agents measures population-level behavior across every interaction: what users are trying to do, how often they succeed, where they get stuck, and which intents correlate with churn or escalation. Unlike per-call telemetry, conversation analytics requires segmenting, clustering, and labeling natural-language interactions. Moda automates this with hierarchical clustering of conversation segments.

RelatedIntent discovery, AI agent observability, Cluster hierarchy

F

Frustration root cause

Also known as: Frustration detection

For every detected frustration event, the trigger turn, leading trajectory, affected user goal, and counterfactual for what the agent should have done.

Frustration root cause goes beyond sentiment scoring. When Moda detects a frustration event in a conversation, we attach four things: the trigger turn that flipped sentiment, the trajectory leading up to it, the user goal that was at risk, and a counterfactual — what the agent should have done instead. The counterfactual is generated by a rationalizing language model (RLM) pass over the full conversation. The result is a debuggable signal, not a number on a chart.

RelatedBehavioral failure, AI agent observability

G

Goal drift

A behavioral failure where the agent silently abandons the user's original objective and solves a different problem.

Goal drift is when an AI agent quietly changes the problem it is solving partway through a conversation. The user asks for X, the agent helps with related-but-different Y, and the original objective is never met. Goal drift is detected by comparing initial user intent to final outcome at the conversation level — something traces of individual calls cannot see.

RelatedBehavioral failure, Intent discovery

H

Hallucination

When the agent fabricates facts, invents tool calls, or claims to have performed actions it did not perform.

In an AI agent context, a hallucination is any output that is presented as factual but is not grounded in real data, real tool calls, or the conversation history. Common forms include inventing API parameters, citing non-existent tools, fabricating customer records, and claiming completion of actions the agent never took. Moda treats hallucinations as a category of behavioral failure and surfaces them when the agent's claimed actions diverge from its actual tool calls.

RelatedBehavioral failure, Tool misuse

I

Intent discovery

Also known as: Intent clustering, Intent taxonomy

Hierarchical clustering of production conversations into a 3-level taxonomy with no manual tagging.

Intent discovery is the process of figuring out what users are actually trying to do with your agent, without anyone hand-labeling conversations. Moda's intent discovery embeds every segment with Qwen3-Embedding-8B (4096-D), clusters them with HDBSCAN, projects with UMAP, and labels clusters with TF-IDF plus Claude Haiku 4.5. The output is a Category → Subcategory → Cluster taxonomy of real user goals.

RelatedCluster hierarchy, Segment, Conversation analytics

L

LLM observability

Monitoring large language model applications — prompts, completions, tokens, latency, and errors.

LLM observability is the call- and trace-level layer of monitoring for LLM applications: prompts, completions, token usage, latency, retries, and errors. Tools like LangSmith and Langfuse own this layer. Agent observability sits on top: it clusters conversations to surface what users want and detects behavioral failures that LLM observability alone cannot see. Most production teams run both.

RelatedAI agent observability, LLM tracing

LLM tracing

Recording the step-by-step execution of an LLM application — every prompt, tool call, and response — for debugging.

LLM tracing captures the chronological steps of a single LLM-powered request: the prompts sent, the tool calls made, the intermediate reasoning, and the final response. Tracing is essential for debugging individual interactions but only sees one conversation at a time. To answer 'what are users doing across thousands of conversations' or 'which behavioral failure is most common,' tracing data has to be aggregated and clustered — the job of agent analytics.

RelatedLLM observability, AI agent observability

M

MCP server

Also known as: Model Context Protocol server

A server that exposes tools and resources to AI coding assistants via the Model Context Protocol.

An MCP (Model Context Protocol) server is a process that exposes a set of tools and resources to an MCP-aware client like Claude Code, Cursor, or Windsurf. Moda ships an MCP server for AI agent debugging: from inside your IDE you can pull failing conversations, inspect tool call failures, and ask Moda's data agent natural-language questions over your production conversation data.

RelatedAI agent observability, Moda

Moda

Moda is a product analytics and observability platform for AI agents. It analyzes every production conversation to surface what users want, where agents fail, and why users get frustrated.

Moda is the continuous learning loop for AI agents. We ingest every production conversation through an OpenTelemetry-native pipeline, automatically cluster them into a hierarchy of user intents, and detect behavioral failures — tool misuse, context loss, agent laziness, hallucinations, reasoning loops, and goal drift — that traces alone do not catch. Every frustration event ships with root cause and an agent counterfactual. Moda is backed by Y Combinator and is provider-agnostic across Anthropic, OpenAI, Google, OpenRouter, and any OTEL-compatible LLM stack.

RelatedContinuous learning, AI agent observability, Intent discovery, Behavioral failure

O

OpenLLMetry

An open-source SDK that emits OpenTelemetry-compatible traces for LLM and agent applications.

OpenLLMetry is an open-source instrumentation library that emits OpenTelemetry traces for LLM applications across major providers (OpenAI, Anthropic, Google, Cohere, Mistral, AWS Bedrock, Azure OpenAI) and agent frameworks (LangChain, LlamaIndex, Mastra). Moda ingests OpenLLMetry traces directly — typically three lines of SDK code and a base URL change get you streaming conversations into Moda.

RelatedOpenTelemetry, Zero-config ingest

OpenTelemetry

Also known as: OTEL, OTLP

The industry-standard open-source framework for emitting and collecting telemetry — traces, metrics, and logs.

OpenTelemetry (OTEL) is a CNCF project that defines a vendor-neutral wire format (OTLP) and SDKs for traces, metrics, and logs. Moda ingests AI agent telemetry via OTLP/HTTP, which means any application already instrumented with OpenTelemetry — directly or through OpenLLMetry — can stream production conversations into Moda without bespoke integration work.

RelatedOpenLLMetry, Zero-config ingest

R

Reasoning loops

A behavioral failure where the agent retries the same failed action, oscillates between answers, or fails to converge.

Reasoning loops occur when an AI agent gets stuck — repeating the same tool call after it has already failed, flipping between two competing answers, or running a chain-of-thought that never terminates. The user experience is delay, wasted tokens, and eventual abandonment. Moda detects reasoning loops by analyzing per-turn action sequences and surfacing cycles.

RelatedBehavioral failure, Tool misuse

S

Segment

A topic-coherent slice of a conversation, identified by cosine drift between consecutive turn-pair embeddings.

A segment is the unit of clustering in Moda. We embed turn pairs with Qwen3-Embedding-8B and walk the conversation looking for cosine drift between consecutive embeddings; when drift crosses threshold, we cut a segment boundary. Clustering segments — instead of whole conversations — keeps clusters topically tight and lets a single multi-topic conversation contribute to multiple intent clusters.

RelatedCluster hierarchy, Intent discovery

Self-improving agent

Also known as: Self-improving AI agent

An AI agent that uses production signal — user intent, failures, corrections, frustration — to make itself better over successive iterations.

A self-improving agent is one that does not stop changing after deployment. Each production interaction becomes data: which intents are underserved, which tools misfire, which prompts trigger frustration, which corrections the user actually accepts. That signal is fed back into evals, prompts, tool definitions, retrieval, fine-tuning, or routing rules so the next version of the agent is materially better than the last. Moda provides the discovery and debug half of that loop — the part that figures out, automatically, what to feed back.

RelatedContinuous learning, Intent discovery, Behavioral failure, Moda

T

Tool call failure

An explicit failure of a tool call — timeout, error response, or invalid arguments — surfaced with error subtypes and trends.

A tool call failure is the surface-level cousin of tool misuse: the tool itself errored, timed out, or returned an invalid response. Moda automatically flags these across conversations, groups them by error subtype, plots hit-rate trends, and lets you drill into the affected conversations. Unlike tool misuse, tool call failures are visible to standard tracing — but they are not always visible across thousands of conversations at once.

RelatedTool misuse, Behavioral failure

Tool misuse

A behavioral failure where the agent calls the right tool with wrong arguments — or the wrong tool — while still returning a plausible response.

Tool misuse covers all the ways an agent can break a workflow without anything raising an error: passing subtly wrong arguments, choosing the wrong tool from a similar set, or producing a confident response that is not actually grounded in the tool's output. Detecting tool misuse requires understanding tool semantics — what the tool was supposed to do — not just whether it returned 200.

RelatedBehavioral failure, Tool call failure, Hallucination

Z

Zero-config ingest

Three lines of SDK code plus OpenTelemetry-native intake — works with any LLM provider, no lock-in.

Zero-config ingest is Moda's onboarding promise: drop in OpenLLMetry, point at moda.dev, and ship. Conversations stream into your workspace within seconds. There is no SDK lock-in — you can also POST raw JSON to /v1/ingest or /v1/ingest/multi if you prefer not to use OpenTelemetry. Hierarchical clustering takes 10–60 minutes to build the first time, depending on volume.

RelatedOpenTelemetry, OpenLLMetry

See your own agents through Moda.

Production conversations in. Continually improving agents out. Book a demo and we'll show you Moda on your own data within 60 minutes.