Skip to content
Search ESC

Context Engineering for Production AI Agents

2026-03-24 · Updated 2026-04-03 · 14 min read · Igor Bobriakov
TL;DR
  • Context engineering is the practice of deliberately constructing what an agent sees — memory, retrieved documents, tool outputs, state — not just how it's instructed. Prompt engineering is a subset.
  • In document-heavy agent workflows, dynamic context assembly often matters more than another prompt rewrite because it controls what evidence the model can actually see.
  • The four context layers — instruction, episodic memory, retrieved knowledge, and tool state — must be managed independently with explicit priority and eviction rules, or they corrupt each other under load.
  • Context window overflow is not a token budget problem. It's an architecture problem: agents without eviction policies silently drop the most structurally important context when windows fill.
  • LangGraph's checkpoint-based state enables per-node context scoping, which lets you pass full conversation history to a supervisor while giving specialist subagents only the slice they need.
  • Model routing by context density lets teams reserve higher-reasoning models for synthesis-heavy calls while sending short structured steps to cheaper models.

Prompt engineering was always a workaround. You couldn’t control what the model knew, so you controlled how you asked. But when you move from a single-turn chatbot to a production agent that accumulates tool outputs, retrieves documents, maintains conversation history, and hands state to downstream subagents, the words in your system prompt become the least important variable in the equation. What the model sees — the full assembled context at every inference call — is what determines output quality, latency, and cost.

Context engineering is the discipline of deliberately constructing that assembled context: what gets included, in what order, at what priority, and what gets evicted when the window fills. In document-heavy agent systems, restructuring the context assembly layer can change groundedness more than another prompt rewrite because the prompt is not the evidence. The architecture decides which evidence reaches the model at all.

Why Prompt Engineering Fails at Agent Scale

Prompt engineering assumes a relatively static input: a user query plus a fixed system instruction. In an agentic system, the input is dynamic, accumulating, and often adversarial to your token budget. By the third tool call in a multi-step research agent, the “context” reaching the model might include a system prompt, three prior conversation turns, four tool outputs of variable length, and two retrieved document chunks — assembled in whatever order your scaffolding appended them. If you haven’t made deliberate choices about that assembly, you’ve made accidental ones.

The failure modes that prompt engineering cannot fix:

  • Context poisoning — a low-relevance retrieved chunk that contradicts the high-relevance chunk, causing the model to hedge or hallucinate a synthesis
  • Silent truncation — when the context window fills, most frameworks truncate from the front, silently dropping system instructions
  • State bleed — tool outputs from prior agent steps leaking into the context of a specialist subagent that has no business seeing them
  • Positional degradation — the documented “lost in the middle” phenomenon where models under-attend to information placed in the middle of long contexts (Nelson et al., “Lost in the Middle”, 2023)

These are structural problems. A better-worded instruction cannot fix a structurally corrupted context window. For a deeper look at how stateful agent architectures create these pressures, our LangGraph state management guide covers the checkpoint layer that makes context scoping tractable.

Context window overflow is not a token budget problem — it’s an architecture problem. Agents without explicit eviction policies silently drop system instructions when windows fill, producing unpredictable outputs with no error signal.

The Four Context Layers Every Production Agent Needs

A production agent’s context is not a single string. It’s composed of four structurally distinct layers that must be managed independently, with explicit priority ordering that governs what gets evicted under token pressure. Treating them as a flat concatenation is the most common context engineering mistake we encounter in inherited codebases.

Four context layers assembled at each agent node: system instructions at highest priority, episodic memory, retrieved knowledge, and tool state at lowest priority, all flowing through an eviction policy before reaching LLM inference Diagram 1: The four context layers assembled at each agent node — instruction, episodic memory, retrieved knowledge, and tool state — with explicit priority ordering for eviction.

Our production RAG pipeline checklist covers the reranking and deduplication implementations in detail. The key integration point for context engineering: the reranker score becomes your relevance gate threshold, and that threshold is a tunable parameter per agent role — a validation agent needs higher confidence than an exploratory research agent.

from sentence_transformers import CrossEncoder
from dataclasses import dataclass
RERANKER = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
@dataclass
class RetrievedChunk:
content: str
source: str
similarity_score: float
rerank_score: float = 0.0
def gate_retrieval_context(
query: str,
raw_chunks: list[RetrievedChunk],
relevance_threshold: float = 0.3,
max_chunks: int = 5,
dedup_threshold: float = 0.92,
) -> list[RetrievedChunk]:
"""
Apply reranking, relevance gating, and deduplication to raw retrieval results.
relevance_threshold: minimum reranker score to pass the gate (0.3 = moderate confidence)
dedup_threshold: cosine similarity above which two chunks are considered duplicates
"""
if not raw_chunks:
return []
# Rerank
pairs = [(query, chunk.content) for chunk in raw_chunks]
scores = RERANKER.predict(pairs)
for chunk, score in zip(raw_chunks, scores):
chunk.rerank_score = float(score)
# Gate by relevance
gated = [c for c in raw_chunks if c.rerank_score >= relevance_threshold]
gated.sort(key=lambda c: c.rerank_score, reverse=True)
# Deduplicate: keep highest-scoring, skip near-duplicates
# Simplified: in production use embedding cosine similarity for dedup
seen_sources: set[str] = set()
deduplicated: list[RetrievedChunk] = []
for chunk in gated:
# Source-level dedup as minimum viable implementation
if chunk.source not in seen_sources:
deduplicated.append(chunk)
seen_sources.add(chunk.source)
if len(deduplicated) >= max_chunks:
break
return deduplicated

Retrieval context gating — reranking, relevance thresholding, and deduplication applied before context assembly — addresses the structural root cause of hallucination in RAG agents more directly than any instruction-level prompt modification.

Model Routing by Context Density

Once you treat context as a structured artifact, model selection becomes a context-aware routing decision rather than a global configuration. Not every agent node requires the same model, and the cost difference is not marginal. A fast lower-cost model is usually a better fit for short, structured retrievals and classification steps where context is dense but decision complexity is low. A higher-reasoning model belongs on multi-document synthesis steps where the assembled context is large, contradictory, or requires multi-hop reasoning.

Inference Cost: Flat premium reasoning model (all nodes):
Every node uses the same premium model, including short classification and routing steps = baseline cost index 1.0

Inference Cost: Routed (fast model for structured, premium model for synthesis):
Most structured tokens routed to the fast model, synthesis-heavy tokens routed to the premium model = material cost reduction, no measured accuracy regression on structured tasks

The routing key is context density score: the ratio of retrieved knowledge tokens to total context tokens. When a node’s context is dominated by retrieved documents requiring synthesis, route to the higher-reasoning model. When it is primarily structured state and a short query, route to the fast model. This heuristic requires no ML model — it is computable from token counts before the inference call.

Expert Insight: Set your context budget as a fraction of model maximum, not an absolute number When you swap models, the context budget should auto-scale as a fraction of the new model’s window, not remain as an absolute token count. Resolve the budget from model_max_tokens at routing time and reserve explicit headroom. This prevents the budget from silently becoming a hard ceiling when you route to a smaller model mid-deployment.

What Breaks at Scale

The patterns above are sound at moderate concurrency. At scale — hundreds of concurrent agent sessions, long-running tasks measured in hours, or multi-agent graphs with 8+ specialist nodes — several failure modes emerge that the architecture must anticipate.

Checkpoint store latency under concurrent load. LangGraph’s Redis-backed checkpoint store can add visible latency when many agent sessions hydrate state at once. The fix: pre-warm the checkpoint store during pod startup, and use read replicas for state hydration in supervisor nodes. Do not rely on the default single-node Redis configuration in production.

Eviction policy drift across agent versions. When you deploy a new version of an agent node with a different context schema, in-flight sessions in the checkpoint store may have messages that the new eviction policy misclassifies. We maintain a schema version field in AgentState and apply migration logic in the state loader when the version doesn’t match. Skipping this causes the eviction policy to silently remove the wrong layer during the rollout window.

Reranker latency at high chunk volume. Cross-encoder rerankers can become synchronous bottlenecks when every agent call waits on large chunk batches. In production, run the reranker as a dedicated async service with its own latency budget, separate from the agent’s main inference path. Agents that fail to get a reranker response within that budget should fall back to raw-similarity retrieval with a warning log — not a hard failure.

Context injection by adversarial tool outputs. If your agent calls external APIs and injects those outputs into Layer 4, a malicious or misconfigured external service can inject content designed to override Layer 1 instructions — the classic prompt injection via retrieved content. For agents operating in high-trust environments, Layer 4 tool outputs must be sanitized before context assembly. This is not a theoretical risk; for a deeper treatment of the threat model, our self-correcting agent architecture guide covers validation loops that catch this class of failure.

Multi-agent systems with 8+ specialist nodes require dedicated async reranker infrastructure, checkpoint store pre-warming, and schema-versioned state migration — or context engineering degrades silently as concurrency increases.

Frequently Asked Questions

What is the difference between context engineering and prompt engineering?

Prompt engineering focuses on the wording of instructions given to a model. Context engineering is the broader discipline of managing everything the model sees at inference time — instructions, retrieved documents, conversation history, tool outputs, and structured state. Prompt engineering is one input to context engineering, not a substitute for it. In production agents, the quality of dynamically assembled context consistently matters more than prompt phrasing.

How do you prevent context window overflow in a production AI agent?

You need an explicit eviction policy defined at the architecture level, not handled reactively by truncation. The safest pattern is priority-ranked eviction: tool outputs first, then distant episodic memory, then retrieved chunks, and never the system instruction layer. In LangGraph, implement this as a state reducer that trims before each node transition, not after. Waiting until the window is full means the model has already received corrupted context.

Can context engineering reduce AI agent hallucinations?

Yes, and it is often more effective than prompt-based anti-hallucination instructions. Hallucinations in retrieval-augmented agents are commonly caused by retrieved context that contradicts, duplicates, or is irrelevant to the query — not by poor instructions. Reranking, deduplication, and relevance gating on the retrieval layer address the root cause.

What is the right context window size for a production AI agent?

The right size is the smallest window that contains all the context actually needed for the decision — not the maximum the model supports. Larger windows increase inference latency and cost, and long-context models can exhibit measurable accuracy degradation on tasks requiring precise retrieval from very long contexts (the “lost in the middle” phenomenon documented by Nelson et al., 2023). Set a context budget below the model maximum and leave explicit headroom for tool output bursts.

The decision rule

Treat context as an architecture artifact when the agent’s failures come from missing, stale, contradictory, or overlong evidence. If the same prompt behaves differently as retrieved context changes, the next fix belongs in assembly, gating, routing, or eviction policy, not in another instruction rewrite.

Technical Review

Bring the system under review

Send the system context, constraints, and pressure. A Principal Engineer reviews it and recommends the next step.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

About the author

Igor Bobriakov

AI Architect. Author of Production-Ready AI Agents. 15 years deploying production AI platforms and agentic systems for enterprise clients and deep-tech startups.