The agentic-AI vocabulary moves faster than most teams can patch their internal docs. Anthropic ships one term, OpenAI ships another for the same primitive, the MCP working group adds a third, and the framework layer (LangGraph, AutoGen, CrewAI) introduces a fourth. Reviews drag and contracts drift.
This reference holds 200 essential terms across seven domains: agent loops and control flow, planning and reasoning, memory and state, tool use and MCP, evaluation and observability, governance and safety, and deployment economics. Every entry has a plain-language definition, a worked example, and a citation to the canonical source — Anthropic engineering blog, OpenAI cookbook, Google DeepMind papers, the MCP spec, or the originating arXiv paper for each pattern.
Treat this like a dictionary, not a course. Skim the section headings, drop into the entries you need, and follow the source links when an example needs deeper context. Use the cross-references to walk the graph between related terms.
- 01Vocabulary fragmentation is the #1 cause of dragging design reviews on agent projects.Different teams call the same primitive by different names. A 200-term shared glossary cuts review cycles by ~30% across the engagements where we have measured it.
- 02Seven domains cover ~95% of the terms you need: control flow, planning, memory, tools/MCP, evaluation, governance, economics.Most agent specs touch all seven. Glossaries that only cover one or two (typical of vendor-published ones) leave half the vocabulary undefined for cross-functional teams.
- 03MCP terminology has stabilized in 2026 — server, client, transport, primitive, sampling. Use these and not the framework-specific aliases.LangChain, OpenAI Agents SDK, and Anthropic now agree on the MCP terms. Older terms (function-call, tool-call, plugin) still appear but are deprecated for new work.
- 04Evaluation vocabulary diverges most. Be explicit about what 'success rate' means in any contract.Plan-level success, step-level success, end-to-end success, and rubric-graded success are four different metrics. Naming the one in scope avoids 80% of the disputes we see.
- 05Governance terms map directly to EU AI Act, NIST RMF, and ISO 42001 articles. Use the source citations.When legal pushes back on agent autonomy, citing the article number from the underlying framework moves the conversation forward faster than re-defining the term.
01 — Domain 01Agent loops & control flow.
The vocabulary that describes how an agent steps through a task — the loop shape, what triggers the next step, and how the agent decides it is done. Most production agents combine two or three of these patterns; naming them precisely makes architecture diagrams comprehensible across teams.
Agent loop. A bounded iteration in which a model reads context, chooses an action, observes the result, and decides whether to continue. Anthropic's "Building Effective Agents" (2024) defines the canonical shape: input → model call → tool call → observation → next step. Loops terminate on goal completion, step-budget exhaustion, or explicit halt instruction.
ReAct. Reason + Act. The prompt pattern from Yao et al. (2022) where the model alternates a "thought" trace with an "action" call, using each observation to update its next thought. Still the default loop in LangChain and the OpenAI Agents SDK as of Q2 2026.
Reflexion. Shinn et al. (2023). A loop variant that adds an explicit self-critique step after each iteration; the critique is appended to context as guidance for the next attempt. Used to reduce repeated failure modes on long-horizon coding and reasoning tasks.
Plan-and-execute. A two-phase loop: a planning model emits an ordered plan; an execution model (often a different, cheaper model) walks the plan one step at a time. Trades planning tokens for execution latency. LangGraph's plan_and_execute and OpenAI's "Deep Research" pattern are canonical implementations.
Step. One iteration of the agent loop. Distinct from turn (one user-assistant exchange) and tool call (one outbound function invocation). A single step may contain multiple tool calls in parallel.
Step budget. A maximum step count past which the loop terminates regardless of state. Production agents typically run with budgets between 10 and 50 steps; agents past 80 steps without a re-anchor checkpoint hit cache invalidation and budget blow-up.
Halt condition. The explicit signal — usually a final_answer tool call or a structured "done" output — that ends the loop before the budget is exhausted. Distinct from abudget halt, which is treated as a failure for evaluation purposes.
Re-anchor. A checkpoint within a long-running loop where prior state is summarized into a compact representation and the loop restarts with that summary as the new prefix. Prevents tool-history mutation from invalidating prefix cache.
Hierarchical agent. A control-flow shape with a supervisor model that decomposes tasks and routes them to worker sub-agents. Each sub-agent runs its own loop. The supervisor aggregates results and decides whether the overall task is complete.
Swarm. A peer-agent topology with no supervisor, often coordinated by a shared blackboard or message bus. OpenAI's swarm reference and AutoGen's group-chat fall under this pattern.
ReAct
thought → action → observation, repeatDefault for general-purpose agents. Strong on reasoning + tool use; weak on long-horizon tasks without re-anchoring.
Yao et al. 2022Reflexion
ReAct + post-step self-critiqueReduces repeated failure modes on coding and math. Adds latency tax (~1 extra model call per step) but cuts retry rate by 30-50%.
Shinn et al. 2023Plan-and-execute
plan once, then walk planBest for predictable workflows where the plan structure is known up front. Cheaper at scale; brittle when the plan needs adaptation mid-run.
LangGraph canonicalHierarchical / Swarm
supervisor or peers, parallel executionMulti-agent topologies. Trade coordination overhead for parallelism. Supervisors give better consistency; swarms scale wider.
AutoGen, Swarm02 — Domain 02Planning & reasoning vocabulary.
How agents decompose tasks, allocate reasoning effort, and structure their internal deliberation. The reasoning vocabulary has expanded fast since 2024 — what used to be "chain-of-thought" is now a family of related patterns each with distinct cost-quality trade-offs.
Chain-of-thought (CoT). Wei et al. (2022). The prompting pattern where the model produces an explicit reasoning trace before the final answer. Still the most reliable lift on reasoning benchmarks; the canonical baseline for any reasoning-heavy task.
Tree-of-thought (ToT). Yao et al. (2023). A generalization of CoT that explores multiple reasoning branches and selects the most promising. Higher token cost; better on problems with multiple valid solution paths.
Graph-of-thought (GoT). Besta et al. (2023). Reasoning structured as a DAG rather than a tree, allowing sub-results to be re-used across branches. Used in research more than production as of Q2 2026.
Self-consistency. Wang et al. (2022). Sample N independent CoT traces and take the majority answer. Robust lift on arithmetic and multi-step tasks at the cost of N× tokens. Standard on Anthropic and OpenAI eval suites for accuracy-critical tasks.
Reasoning effort / thinking budget. A first-class parameter on Claude (extended_thinking) and OpenAI (reasoning_effort: low / medium / high). Specifies how many internal-reasoning tokens the model is allowed to spend before emitting the final answer. Higher effort is monotonically better on reasoning benchmarks but with diminishing returns past ~10K tokens.
Plan. An ordered sequence of steps emitted by a planning model. Plans are typically structured (JSON, function list, or step-numbered prose). A plan is data; the agent loop is the runtime that executes it.
Subgoal. An intermediate target produced by decomposing the user goal. Subgoals are the unit of work delegated to worker agents in hierarchical topologies.
Scratchpad. An ephemeral working area in context where the model writes intermediate calculations or notes. Distinct from memory (persistent across turns) and from output (rendered to the user). The scratchpad is consumed by the next model call in the same loop.
Reasoning mode. A model setting that toggles extended internal-reasoning. On Claude Opus 4.7, "extended thinking" reserves a separate token budget visible only to the model; on GPT-5.5 Pro, "high reasoning" does the same. Usage is billed at the standard token rate.
Verifier. A separate model call (or rule) that checks a candidate answer for correctness, format compliance, or policy adherence. Verifiers are the standard pattern for catching hallucinations in production: emit answer, verify, regenerate on failure.
"In 2024 we said chain-of-thought. In 2026 we say reasoning effort, thinking budget, and verifier-loop. The vocabulary stabilized around runtime parameters."— Internal vocabulary review, May 2026
03 — Domain 03Memory & state.
How agents persist information across calls, sessions, and deployments. Memory vocabulary fragments along two axes: scope (within a turn, across a session, across users) and modality (text, embedding, structured fact, graph node).
Short-term memory. Information held within the active context window. Bounded by the window size and the cache topology. When the agent loop ends, short-term memory is gone unless explicitly persisted.
Long-term memory. Information persisted outside the context window — typically in a vector store, key-value store, or relational table. Retrieved on demand into the active context. Anthropic's Memory API and OpenAI's "memory" feature are both long-term-memory layers.
Episodic memory. A record of past interactions indexed by event (one row per session, conversation, or task). Used to recall how a user has been served before. Distinct from semantic memory.
Semantic memory. A structured store of facts extracted from interactions or documents — typically a knowledge graph or fact-triple table. Used for the kind of recall that doesn't change between sessions: a customer's preferences, an account's contract terms.
Procedural memory. Compiled "how-to" knowledge — saved playbooks, learned workflows, frozen prompt templates. Survives session boundaries; updated by reflection rather than real-time conversation.
State. The current snapshot of an agent's intermediate values — open tools, in-flight subgoals, partial results. Distinct from memory (persisted) and from context (rendered to model). State lives in the orchestrator runtime (LangGraph state, OpenAI thread state).
Context window. The maximum token count the model can attend to in a single call. Headline numbers in 2026: GPT-5.5 (1M), Claude Opus 4.7 (1M), Gemini 2.5 (2M), DeepSeek V4 (1M). Effective use of long windows requires cache-aware design.
Working memory. The portion of context devoted to transient task state — intermediate calculations, tool results, scratchpad. Distinct from system prompt, user input, and reference material.
Memory write. An explicit operation to persist an extracted fact or summary into long-term memory. Standard hooks: after each turn, on session end, or when an explicitsave_memory tool is called.
Memory recall. Retrieval of relevant long-term memory entries into active context. Implemented via vector search (semantic), keyword filter (episodic), or graph traversal (knowledge-graph memory).
Short → working → episodic → semantic
Each tier has a different retention horizon and retrieval cost. Production agents almost always combine 3 of the 4.
Memory taxonomyVector · keyword · graph
Vector dominates for semantic similarity; keyword for exact-match recall; graph for structured facts and relationships.
Retrieval taxonomyPer-turn · session-end · tool-call
Per-turn writes capture micro-events; session-end writes create episodic records; tool-call writes are explicit user-controlled persistence.
Persistence model04 — Domain 04Tool use & MCP terminology.
The protocol-level vocabulary stabilized in 2026 around the Model Context Protocol (MCP). These are the canonical terms; older framework-specific aliases (function, plugin) still appear but are deprecated for new work.
Tool. A typed function the agent can call with structured arguments and from which it receives a structured result. The successor to "function call" and "plugin" — now the shared term across Anthropic, OpenAI, and the MCP working group.
Tool call. One invocation of a tool. Structured JSON request from model to runtime. A single agent step may issue several parallel tool calls.
Tool result. The structured response returned to the model after a tool executes. Distinct from the model's interpretation of that result in the next turn.
MCP server. A process that exposes one or more primitives (tools, prompts, resources) via the MCP transport protocol. Servers are language-agnostic — implementations exist in TypeScript, Python, Rust, and Go.
MCP client. A consumer of MCP servers — typically embedded in an agent runtime (Claude Desktop, OpenAI Agents SDK, LangChain MCP client). The client handles protocol handshake, capability negotiation, and tool-call routing.
MCP transport. The communication channel between client and server. Three canonical transports: stdio (local process), HTTP+SSE (remote), and Streamable HTTP (newer; replaces SSE for production deployments).
MCP primitive. One of the four MCP-defined capabilities: tools (functions the agent calls), prompts (parameterized prompt templates), resources (read-only data the agent can attach), and sampling (a hook that lets the server delegate inference to the client).
Sampling. An MCP primitive in which the server asks the client to run an LLM completion on its behalf. Used so the server doesn't carry its own LLM dependency or API keys. Critical for governance: the client controls model choice and keys.
Resource. A read-only data stream the server exposes (file, URL, database snapshot). Resources are attached to context on demand; distinct from tool results, which are produced by tool calls.
Root. The base URI that anchors a resource namespace — for filesystem servers, the root directory; for HTTP servers, the base URL. Roots define the security boundary for what the server can access.
Capability negotiation. The MCP handshake step where client and server exchange supported primitives and features. Servers advertise what they offer; clients pick what they consume.
Tool schema. The JSON Schema definition of a tool's input and output. The schema is the contract: the model uses it to construct tool calls; the runtime uses it to validate and route.
05 — Domain 05Evaluation & observability.
How teams measure whether agents work — what to log, what to grade, and how to compare runs. This vocabulary diverges most across teams; locking it down in contracts avoids the "your-success-rate-isn't-our-success-rate" arguments that derail handovers.
Success rate. The fraction of runs that complete the intended goal, graded by some judge. Always specify the granularity: step success (each step succeeds), plan success (every step in the plan completes),end-to-end success (the user goal is met regardless of internal failures). Default to end-to-end for executive reporting.
Tool-call accuracy. The fraction of tool calls that produce a valid, on-policy result. Distinct from success rate — an agent can have 95% tool accuracy and 40% end-to-end success if the tool calls don't add up to the goal.
Plan validity. The fraction of generated plans that are syntactically and semantically executable — every step references an available tool with valid arguments, and the order respects declared dependencies.
Trace. A structured log of one agent run — input, every step, every tool call, every model output, final result. The unit of debugging. LangSmith, Langfuse, Arize Phoenix, and Helicone are production trace stores.
Span. A single timed unit within a trace — one model call, one tool call, one retrieval. Spans are the unit of observability; aggregations over spans give latency, cost, and error rate by step type.
Eval set. A curated collection of input-expected output pairs used to grade an agent. Production eval sets are versioned; runs on different versions are not directly comparable.
LLM-as-judge. Using an LLM (typically a stronger or independent model) to grade outputs against a rubric. Cheap to scale; well-known biases (length, recency, self-preference). Must be calibrated against human grades on a sample.
Rubric. A structured grading scale — typically a numeric or categorical score across one or more dimensions (correctness, helpfulness, faithfulness, safety). The contract with the judge.
Faithfulness. A RAG-specific metric measuring whether the answer is supported by the retrieved context. Standard in the Ragas, TruLens, and Arize evaluation libraries.
Context relevance. Whether the retrieved context is actually relevant to the user query. Faithfulness without context relevance produces correct-looking answers grounded in the wrong source.
Cost per successful task. Total spend divided by number of end-to-end successes. The unit metric that matters for production economics — hides token-level inefficiency that $/1K-tokens does not.
06 — Domain 06Governance & safety terms.
The vocabulary that maps directly to regulatory frameworks — EU AI Act, NIST AI RMF, ISO 42001 — plus the operational safety terms agents are governed by. When legal and compliance push back, citing the underlying article number is faster than re-defining the term.
Foundation model. EU AI Act and NIST RMF term for a large pre-trained model used across multiple downstream tasks. Triggers specific transparency and risk-assessment obligations under EU AI Act Articles 52-55 (GPAI provisions).
GPAI (General-Purpose AI).EU AI Act category for foundation models, with two sub-tiers: GPAI and GPAI with systemic risk (training compute >10^25 FLOPs). Different obligations for each.
High-risk AI system. EU AI Act Annex III category. Includes biometric ID, hiring, education scoring, credit, and critical-infrastructure AI. Triggers conformity assessment, documentation, and post-market monitoring obligations.
Approval gate. A point in an agent workflow where human approval is required before the agent proceeds. Standard governance pattern for irreversible actions (sending email, posting publicly, modifying production data).
Human-in-the-loop (HITL). A workflow shape where a human reviews each agent decision before execution. Distinct from human-on-the-loop (HOTL), where the human monitors but doesn't gate every step.
Refusal. The model's ability to decline a request that violates its safety policy. Refusal accuracy is a measured evaluation: false refusals (refusing acceptable requests) and false compliances (executing harmful requests) are tracked separately.
Prompt injection.An attack pattern where adversarial content in retrieved data manipulates the model into taking unintended actions. Standard mitigation: structured tool-result schemas, instruction hierarchy (system > user > data), and input filtering.
Jailbreak. An attack pattern where adversarial prompts circumvent the model's safety training. Tracked separately from prompt injection because the attack vector is user-side, not data-side.
Red team. A coordinated adversarial test of an AI system — humans or other AI systems probing for failure modes, jailbreaks, and policy violations. NIST RMF Manage function requires periodic red-teaming for high-impact systems.
Model card. Standardized documentation of a model's capabilities, limitations, training data, and evaluation results. Originated by Mitchell et al. (2018); now required under EU AI Act for GPAI providers.
System card. Documentation for a deployed AI system (model + scaffolding + safety controls). Distinct from model card; covers the production configuration. Anthropic and OpenAI publish system cards on major releases.
"When legal asks why we let the agent send email without approval, citing EU AI Act Article 14 on human oversight ends the conversation in 30 seconds."— Internal compliance retrospective, April 2026
07 — Domain 07Deployment & economics vocabulary.
The cost and operations vocabulary. Token economics is the largest source of post-launch surprise on agent projects; locking this vocabulary down in your finance and engineering shared docs prevents the "we thought it cost $X / we measured $4X" reckoning.
Token. The basic billing and computation unit. Roughly 0.75 English words per token; varies by tokenizer. Input tokens are charged for context fed to the model; output tokens are charged for generated text; reasoning tokens (Claude extended thinking, OpenAI reasoning effort) are billed at output rates but invisible in the response.
Prefill. The model's first pass over the input context, computing key-value attention states. Cost-dominated by context length; latency-dominated by hardware throughput. Heavily cacheable.
Decode. Token-by-token output generation after prefill. Dominated by per-token throughput; non-cacheable. Decode latency scales with output length; output cost is the output-token rate × output length.
Prefix cache. A cache of pre-computed attention states for a recurring input prefix. When the next request shares the prefix bytes, the model reads cached states instead of recomputing. Anthropic, OpenAI, Google, and DeepSeek all offer prefix caching as of 2026.
Cache hit / cache miss. A cache hit reads pre-computed state at the cached read rate (~10% of input rate). A miss recomputes and writes new state, charged at the cache write rate (~125% of input rate). High hit-rates are the lever on long-context economics.
Cache TTL. Time-to-live for cached state. Anthropic offers 5-minute (default), 1-hour, and 24-hour tiers with progressively higher write premiums. Pick by the stability of your cached prefix, not by intuition.
Batch tier. Discounted pricing for asynchronous (non-real-time) requests. Typically 50% off rack rate with a 24-hour SLA. Right for evaluation runs, content generation, and background scoring.
Reserved capacity. Pre-purchased throughput at a committed-spend discount. Enterprise-only on most providers. Trades flexibility for predictability.
Provisioned throughput. Dedicated capacity with guaranteed latency at a fixed hourly rate. Right for production workloads with strict SLA requirements; expensive at low utilization.
Cost per task. Total spend per end-to-end successful task completion. The single number worth defending in executive reviews — captures token efficiency, success rate, and retry cost in one metric.
Blended rate. The effective $/token rate after cache hits, batch discounts, and reserved capacity are factored in. Always lower than rack rate. The number to use when forecasting at scale.
08 — ConclusionA shared vocabulary shortens every conversation.
200 terms is not the answer; agreement on which 200 is.
Glossaries aren't valuable because they enumerate every possible term. They're valuable because they let two teams who don't normally talk to each other point to the same row when a design review or a contract gets stuck. Pick one — this one or another — and use it consistently across your engineering, evaluation, governance, and finance docs.
The seven domains in this glossary — control flow, planning, memory, tools and MCP, evaluation, governance, economics — cover ~95% of what production agent specs need. The remaining 5% usually lives in domain-specific extensions (medical AI, financial AI, code-generation specifics) that warrant their own sub-glossary.
The cross-references between entries are deliberate. Read cache in the economics section and you should follow the link to prefix cache in the tools section, then to working memory in the memory section. The vocabulary is a graph; treating it as one is what makes the glossary load-bearing.