Skip to content
Search ESC

Context Engineering for Production Agents: The Discipline Replacing Prompt Engineering

2026-03-18 · 18 min read · Igor Bobriakov
TL;DR
  • Context engineering -- not prompt engineering -- is the primary reliability lever for production agents when the failure comes from what the model sees, not how the instruction is phrased.
  • A four-tier memory hierarchy (in-context, working cache, episodic/vector, archival/SQL) keeps full conversation history out of every call and retrieves only the information needed for the current step.
  • Dynamic context injection using semantic retrieval at query time is usually safer than stuffing static RAG context into the system prompt for every agent turn.
  • Context budgeting -- explicitly allocating token quotas per layer -- prevents silent truncation failures that are a common cause of unexplained production agent errors.
  • LangGraph's StateGraph with a custom ContextManager node gives teams a place to enforce context assembly before each model call.
  • Tool schema compression -- stripping verbose descriptions and examples from inactive tools -- reduces tool-definition token consumption when large tool catalogs are registered.
  • Prompt caching reduces cost for static context segments like system instructions and repeated retrieved documents, but it does not replace context budgeting.

Most agent failures in production are not caused by bad prompts. They are caused by bad context. The model receives the wrong information, too much information, the right information in the wrong position, or the right information formatted in a way that consumes three times the tokens it should. These are not prompt engineering problems. They are architectural problems — and the discipline that solves them is called context engineering.

Prompt engineering was the right mental model when the primary use case was a single-turn completion: write an instruction, iterate on the phrasing, ship it. Multi-step agents with tool use, long conversation histories, real-time retrieval, and state persistence break that model entirely. At that point, the context window is not a prompt — it is a runtime data structure with finite capacity, positional semantics, and significant cost implications. Managing it well is engineering work, not creative writing.

This article is a production engineering guide to context engineering: the decisions that separate a demo agent that works once from a production system that can handle repeated sessions without silent context degradation. We cover token budget architecture, memory hierarchies, dynamic injection patterns, tool schema management, and prompt caching — with working Python code using LangChain and LangGraph.

Context window budget layers diagram showing allocation of tokens across system, tools, retrieved context, history, and scratchpad

Diagram 1: The context window as a layered token budget. Each layer has a hard allocation; overflow triggers deterministic compression before the LLM call.

The Context Window Is a Finite Runtime Resource

GPT-4o ships with a 128k token context window. Claude 3.5 Sonnet supports 200k. It is tempting to treat these large windows as “effectively unlimited” and move on. This is wrong in at least three independent ways.

Long-context models can exhibit recall degradation — often called the “lost in the middle” effect — when relevant information is buried inside a large context window. This is not a hypothetical concern. In agent systems where retrieved documents, tool outputs, and conversation history pile up over multiple turns, you will routinely push critical context into weak positions unless you actively manage placement.

The second problem is cost. Large context windows make it easy to turn every agent step into an expensive full-history call. Context engineering that reduces average context size is not only a performance concern; it is a financial control.

The third problem is latency. Time-to-first-token scales with input length. Very large prompts add pre-fill latency before the model can start responding, turning a responsive agent into one that feels broken to users.

The solution is to stop treating the context window as a document and start treating it as a structured runtime resource with explicit allocation per layer:

Context LayerContentBudget RoleCompression Strategy on Overflow
SystemPersona, constraints, output formatProtectedNot compressible — optimize statically
Tool SchemasActive tool definitionsBoundedStrip inactive tools, compress descriptions
Retrieved ContextRAG documents, KB snippetsBoundedTop-k truncation, extractive summarization
Conversation HistoryPrior turns in sessionBoundedRolling window + abstractive summarization
Agent ScratchpadIntermediate tool outputs, reasoning stepsBoundedPrune completed steps, keep final results
Current TurnUser message + injected dataProtected user message, bounded injected dataTruncate injected data, not user message
ReserveOutput space + safety marginProtectedN/A — protected

The Four-Tier Memory Architecture

Naive agents store everything in the context window. After three conversation turns, the history alone consumes 40% of the budget. After ten turns, the agent starts silently truncating or hallucinating because it cannot see its own earlier tool outputs. The fix is a memory architecture with four distinct tiers, each with its own storage backend and retrieval mechanism.

Four-tier agent memory architecture showing in-context, Redis session, vector episodic, and SQL archival layers

Diagram 2: Four-tier agent memory architecture. Only Tier 1 is in the context window; Tiers 2-4 surface content via explicit retrieval into the context budget.

Tier 1 — In-Context Working Memory: The active context window itself. Contains only what is needed for the current reasoning step. Everything else lives outside and is fetched on demand.

Tier 2 — Session Memory (Redis): Full conversation history for the current session, stored as a Redis list. The agent reads only the last N turns by default; older turns are summarized and stored as a compressed summary blob. Session expiry should align with the application’s user-session TTL.

Tier 3 — Episodic Memory (Vector DB): Semantically indexed memories from past sessions — task completions, user preferences, resolved ambiguities. Retrieved by embedding similarity at the start of each new session. Pinecone, pgvector, and Weaviate are all production-viable options here; the choice depends on your existing stack more than performance differences at <1M vectors.

Tier 4 — Archival Memory (SQL/Document DB): Structured facts about entities the agent manages — user profiles, account state, configuration. Never retrieved semantically; always fetched by explicit ID lookup. Mixing this with vector retrieval is a common anti-pattern that pollutes episodic search results with structured records.

A four-tier memory hierarchy with explicit retrieval gates reduces context bloat versus naive full-history injection because the agent retrieves what it needs rather than receiving everything by default.

Here is a production-grade Python implementation of a context-aware history manager using LangChain and Redis:

"""
context_manager.py -- Production session memory manager for LangChain agents.
Uses Redis for session storage with automatic summarization on budget overflow.
"""
import json
from typing import List, Optional
import tiktoken
import redis
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage
from langchain_openai import ChatOpenAI
REDIS_SESSION_PREFIX = "agent:session:"
MAX_HISTORY_TOKENS = 20_000 # Hard budget for conversation history layer
SUMMARY_TRIGGER_TOKENS = 16_000 # Summarize when approaching limit
SUMMARY_MODEL = "gpt-4o-mini" # Use cheaper model for compression
class SessionContextManager:
"""
Manages conversation history within a token budget.
Stores full history in Redis; injects a token-bounded
slice into each LLM call, summarizing overflow automatically.
"""
def __init__(self, session_id: str, redis_client: redis.Redis):
self.session_id = session_id
self.redis = redis_client
self.enc = tiktoken.encoding_for_model("gpt-4o")
self._summarizer = ChatOpenAI(model=SUMMARY_MODEL, temperature=0)
self._redis_key = f"{REDIS_SESSION_PREFIX}{session_id}"
def _count_tokens(self, text: str) -> int:
return len(self.enc.encode(text))
def _messages_to_token_count(self, messages: List[dict]) -> int:
total = 0
for msg in messages:
total += self._count_tokens(msg.get("content", ""))
total += 4 # Per-message overhead (role, separators)
return total
def add_turn(self, human_content: str, ai_content: str) -> None:
"""Append a completed turn to persistent session storage."""
turn = {
"human": human_content,
"ai": ai_content,
}
self.redis.rpush(self._redis_key, json.dumps(turn))
self.redis.expire(self._redis_key, 3600) # 1-hour session TTL
def get_context_messages(self) -> List[BaseMessage]:
"""
Return a token-bounded list of messages for injection into context.
If history exceeds SUMMARY_TRIGGER_TOKENS, compresses oldest turns
into a summary message before returning.
"""
raw_turns = [
json.loads(t) for t in self.redis.lrange(self._redis_key, 0, -1)
]
if not raw_turns:
return []
# Build flat message dicts for token counting
all_messages = []
for turn in raw_turns:
all_messages.append({"role": "user", "content": turn["human"]})
all_messages.append({"role": "assistant", "content": turn["ai"]})
total_tokens = self._messages_to_token_count(all_messages)
if total_tokens <= SUMMARY_TRIGGER_TOKENS:
# Within budget -- return all history
return self._dicts_to_messages(all_messages)
# Over budget -- summarize oldest half of turns, keep recent half verbatim.
# Split on turn boundaries (pairs of user+assistant messages) to avoid
# orphaning a user message from its response.
num_turns = len(all_messages) // 2 # Each turn = 2 messages
split_turn = num_turns // 2
split_idx = split_turn * 2 # Always lands on a turn boundary
to_summarize = all_messages[:split_idx]
to_keep = all_messages[split_idx:]
summary_text = self._summarize_messages(to_summarize)
summary_message = {
"role": "system",
"content": f"[Earlier conversation summary]: {summary_text}"
}
final_messages = [summary_message] + to_keep
final_tokens = self._messages_to_token_count(final_messages)
if final_tokens > MAX_HISTORY_TOKENS:
# Still over hard limit -- drop oldest turns until within budget.
# Subtract removed turn tokens incrementally to avoid O(N^2)
# re-counting from scratch on each iteration.
while final_tokens > MAX_HISTORY_TOKENS and len(to_keep) > 2:
removed = to_keep[:2]
removed_tokens = self._messages_to_token_count(removed)
to_keep = to_keep[2:] # Remove oldest remaining turn
final_messages = [summary_message] + to_keep
final_tokens -= removed_tokens
return self._dicts_to_messages(final_messages)
def _summarize_messages(self, messages: List[dict]) -> str:
"""Compress a list of messages into a brief factual summary.
NOTE: This calls the LLM synchronously for clarity. In production,
use ainvoke() and make get_context_messages() async, or run
summarization as a background task after each turn so that
get_context_messages() never blocks on an LLM call.
"""
formatted = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in messages
)
prompt = (
"Summarize the following conversation excerpt into 3-5 concise bullet "
"points capturing decisions made, facts established, and unresolved tasks. "
"Be specific -- include any entity names, IDs, or values mentioned.\n\n"
f"{formatted}"
)
response = self._summarizer.invoke([HumanMessage(content=prompt)])
return response.content
@staticmethod
def _dicts_to_messages(messages: List[dict]) -> List[BaseMessage]:
role_map = {
"user": HumanMessage,
"assistant": AIMessage,
"system": SystemMessage,
}
return [role_map[m["role"]](content=m["content"]) for m in messages]

Pro Tip: Always count tokens before the call, never guess. The single most common cause of silent agent failures is assuming a context fits within the budget and being wrong. Every production context assembly pipeline must call a token counter — tiktoken for OpenAI models, anthropic.count_tokens() for Claude — before dispatching the request. Build a ContextBudgetError exception that fires when any layer exceeds its allocation, and log the overflow breakdown (which layer, by how many tokens) to your observability stack. You will catch the class of bugs that is a leading cause of unexplained production agent failures.

Dynamic Context Injection: Retrieval-Augmented Agent Context

Static RAG — where you retrieve documents once at session start and inject them into the system prompt — works for simple Q&A assistants. It fails for agents because the information needed changes with every reasoning step. When an agent is on step 7 of a 15-step task, the documents relevant to step 7 are entirely different from those relevant to step 1. Injecting all potentially relevant documents upfront either blows the budget or forces you to under-retrieve and miss critical facts.

Dynamic injection solves this by treating retrieval as an agent action — either an explicit tool call or an automatic pre-turn hook that queries the vector store based on the current task state. Here is a LangGraph implementation of a context-injection node that runs before every agent reasoning step:

"""
context_injection_node.py -- LangGraph node for dynamic pre-turn context injection.
Retrieves semantically relevant documents based on current agent state and injects
them into the context budget before the LLM reasoning step.
"""
from typing import TypedDict, List, Optional, Annotated
import operator
from langchain_core.messages import BaseMessage, SystemMessage
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import tiktoken
# ---- State Schema -------------------------------------------------------
class AgentState(TypedDict):
messages: Annotated[List[BaseMessage], operator.add]
current_task: str
retrieved_context: Optional[str] # Injected by this node
context_token_count: int # Tracked for budget enforcement
# ---- Configuration -------------------------------------------------------
RETRIEVED_CONTEXT_BUDGET = 30_000 # Token allocation for retrieved layer
TOP_K_DOCS = 8 # Candidate documents before token truncation
PINECONE_INDEX = "prod-agent-kb"
# ---- Node Implementation -------------------------------------------------
class DynamicContextInjector:
"""
LangGraph node: runs before each LLM step.
Queries the vector store using the current task description + last user
message, selects top-k documents, trims to token budget, and writes
the result into AgentState.retrieved_context.
"""
def __init__(self):
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.vectorstore = PineconeVectorStore(
index_name=PINECONE_INDEX,
embedding=self.embeddings,
)
self.enc = tiktoken.encoding_for_model("gpt-4o")
def _build_query(self, state: AgentState) -> str:
"""
Combine current task + last human message for retrieval query.
Hybrid query is evaluated against task-only and message-only baselines.
"""
last_human = next(
(m.content for m in reversed(state["messages"])
if m.type == "human"),
""
)
return f"{state['current_task']} {last_human}".strip()
def _trim_to_budget(self, docs: List[str], budget: int) -> str:
"""
Include documents in relevance order until the token budget is exhausted.
Never splits a document mid-sentence -- drops the document if it does not fit.
"""
result_parts = []
used_tokens = 0
for i, doc in enumerate(docs):
doc_tokens = len(self.enc.encode(doc))
if used_tokens + doc_tokens > budget:
# Skip document if it does not fit -- do not truncate mid-document
continue
result_parts.append(f"[Source {i+1}]\n{doc}")
used_tokens += doc_tokens
return "\n\n".join(result_parts)
def __call__(self, state: AgentState) -> dict:
query = self._build_query(state)
# Retrieve candidates from Pinecone
docs = self.vectorstore.similarity_search(query, k=TOP_K_DOCS)
doc_texts = [d.page_content for d in docs]
# Trim to token budget
context_text = self._trim_to_budget(doc_texts, RETRIEVED_CONTEXT_BUDGET)
token_count = len(self.enc.encode(context_text))
return {
"retrieved_context": context_text,
"context_token_count": token_count,
}
def inject_context_into_messages(state: AgentState) -> List[BaseMessage]:
"""
Utility: build the final message list for the LLM call,
placing retrieved context as a system message immediately before
the conversation history (positional priority: top of window).
"""
messages = []
if state.get("retrieved_context"):
messages.append(SystemMessage(
content=(
"RETRIEVED CONTEXT (use this to answer the current task):\n\n"
+ state["retrieved_context"]
)
))
messages.extend(state["messages"])
return messages

Tool Schema Management and Sparse Tool Activation

An agent with a large tool registry can carry thousands of tokens of tool definitions in every single LLM call by default. This is one of the most expensive and easily solved token waste patterns in production agents. Stripping inactive tool schemas from the context window — serving only the tools relevant to the current task phase — reduces tool-definition token consumption without asking the model to choose from irrelevant capabilities.

The architecture is a two-pass system: a fast, cheap LLM call (GPT-4o-mini) or a classifier model first identifies which tool category the current task requires, then only those tool definitions are injected into the full reasoning call. For agents with clear task phases (e.g., “research phase” vs. “execution phase” vs. “verification phase”), static phase-based tool sets are even simpler and faster.

|

StrategyToken SavingsLatency OverheadRouting Accuracy ImpactBest For
Full tool schema (baseline)NoneNoneBaselineSmall tool sets
Static phase-based tool setsHighMinimalOften positive because fewer distractorsStructured workflow agents
Semantic tool retrieval (embeddings)HighRetrieval-dependentNeutral to slight positiveLarge, heterogeneous toolsets
LLM pre-pass classifierHighExtra model callNeutral when classifier coverage is goodDynamic task types, unknown user intent
Description compression onlyModerateNoneCan be neutral or negative if descriptions lose discriminating detailLow effort, incremental improvement
Beyond schema volume, description quality matters for token efficiency. Tool descriptions bloated with examples and edge cases consume tokens that could go to retrieved context. The right description format for production: one sentence of purpose, parameter types, and one concrete example of the output format. Everything else is noise.

End-to-end context assembly pipeline showing tool selection, retrieval, history compression, and budget enforcement before LLM call

Diagram 3: End-to-end context assembly pipeline. Every agent turn passes through budget enforcement before reaching the LLM — no silent truncation.

Prompt Caching: The Cost Lever Many Teams Miss

Anthropic’s cache_control parameter and OpenAI’s implicit prefix caching are among the highest-impact optimizations available for production agents, yet adoption among teams we encounter is surprisingly low. The mechanism is straightforward: the provider caches the KV computation for a static prefix of the prompt, and subsequent requests that share that prefix receive discounted input-token handling.

For agents with a fixed system prompt and a static knowledge base injected at the top of every call, a large share of input tokens may be eligible for caching. At scale, this can turn prompt-prefix design into one of the highest-leverage cost controls in the system.

The rules for effective caching:

  • Static content first: System prompt and knowledge base content must come before dynamic content (conversation history, current user message). Cache breaks at the first token that differs between requests.
  • Stable ordering: Retrieved documents should be ordered deterministically (by document ID, not by score) when their content is stable across turns. Score-ordered results change order slightly between calls and break the cache prefix.
  • Minimum cache block size: Anthropic requires at least 1,024 tokens to cache a block; OpenAI requires 1,024 tokens for the prefix. Attempting to cache smaller blocks wastes a cache_control marker with no benefit.
  • Session affinity: Route requests from the same session to the same backend endpoint when possible. Cache hits are per-endpoint; load balancing without sticky sessions destroys cache efficiency.

Production Anti-Patterns and How to Detect Them

Context engineering failures are subtle. The model does not throw an exception when you overflow its context window — it silently truncates, or produces outputs that look superficially correct but are missing key information. Here are the failure patterns we see repeatedly in production systems:

1. Unbounded scratchpad growth. ReAct-style agents that accumulate every intermediate tool output in the scratchpad across a long task can consume tens of thousands of tokens in tool outputs alone. Fix: prune the scratchpad at each step — keep only the final result of completed tool calls, not the full raw output. A very large API response should be summarized to the key facts before storing.

2. System prompt drift. Teams iterate on the system prompt and forget to measure its token cost. A system prompt that starts at 800 tokens grows to 4,000 tokens over six months of “just adding one more instruction.” Run a CI check that fails if count_tokens(system_prompt) > MAX_SYSTEM_TOKENS.

3. Tool output injection without truncation. A tool that queries a database returns 500 rows. The agent faithfully injects all 500 rows into the context. Fix: all tool outputs must pass through a tool output formatter that truncates to a configured token budget before returning to the agent loop.

4. Retrieval without position management. Retrieved documents are appended after the conversation history, placing them in the “lost in the middle” danger zone. Fix: inject retrieved context as a system-level message at the top of the conversation, before history.

5. Per-turn cost blindness. The team monitors total monthly inference cost but not per-turn context breakdown. A single misconfigured retrieval step doubling the average context size doubles the monthly bill — and no alert fires. Fix: emit per-turn metrics for each context layer (system_tokens, tool_tokens, retrieved_tokens, history_tokens) to your observability stack (LangSmith, Datadog, or custom OTel spans).

Frequently Asked Questions

What is context engineering for AI agents?

Context engineering is the architectural discipline of designing, curating, and dynamically managing everything placed into an LLM’s context window at runtime. Unlike prompt engineering — which focuses on phrasing a single instruction — context engineering governs memory hierarchies, retrieval strategies, token budget allocation, and what information gets included or excluded per agent turn. It is the primary determinant of production agent reliability.

What is the difference between prompt engineering and context engineering?

Prompt engineering optimizes the wording of a single instruction or system message, typically in a static, hand-crafted way. Context engineering is a broader architectural discipline that dynamically controls the entire input the model sees at each step: conversation history, retrieved documents, tool schemas, agent scratchpad, and injected state. Prompt engineering is a subset of context engineering. For production agents handling multi-turn, multi-tool tasks, prompt engineering alone is insufficient.

How do you manage token budgets in production LLM agents?

Production token budget management requires explicit allocation of the available context window into named layers with hard limits: system instructions, tool definitions, retrieved context, conversation history, and agent scratchpad. Each layer has a maximum token quota enforced programmatically before the LLM call. When a layer exceeds its budget, deterministic compression strategies apply — summarization for history, top-k truncation for retrieved documents, schema stripping for inactive tools. Tiktoken (for OpenAI models) or the Anthropic token-counting API are standard tools for measuring consumption per layer.

What memory architecture should production AI agents use?

Production agents require a four-tier memory architecture: (1) in-context working memory for the current turn’s active data; (2) short-term session memory backed by Redis or a similar cache for conversation history within a session; (3) episodic long-term memory in a vector database (Pinecone, pgvector) for semantically retrieved past interactions and documents; (4) archival memory in a relational database for structured facts. Information flows upward into the context window on-demand via retrieval, never by default — this is what prevents token bloat at scale.

How does “lost in the middle” affect production AI agents?

The “lost in the middle” effect describes the empirically observed tendency of transformer models to underweight information placed in the middle of a long context window, relative to tokens at the beginning or end. For agents with 128k context windows stuffed with retrieved documents and tool outputs, critical facts buried in the middle may be effectively ignored. The production mitigation is position-aware context assembly: place high-priority information at the top (system context, task description) and the most recent turn’s data at the bottom, keeping the middle for lower-priority background.

Further Reading

The Decision Rule

If production agents behave inconsistently across sessions, hallucinate despite a reasonable RAG setup, or spend more as conversation history grows, inspect the context assembly path first: retrieval order, memory eviction, tool-output scope, and positional placement usually explain more than another prompt rewrite.

Technical Review

Bring the system under review

Send the system context, constraints, and pressure. A Principal Engineer reviews it and recommends the next step.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

About the author

Igor Bobriakov

AI Architect. Author of Production-Ready AI Agents. 15 years deploying production AI platforms and agentic systems for enterprise clients and deep-tech startups.