Skip to content
Search ESC

LangChain Callback Architecture: Building Production Observability Without Third-Party Lock-In

2026-05-27 · 8 min read · Igor Bobriakov

Most LangChain observability setups lock you into a vendor’s callback handler. LangSmith’s LangChainTracer, Langfuse’s handler, Weights & Biases — each ships a pre-built handler that wires your chain’s events to their platform. The convenience is real, but the dependency is structural: your observability pipeline is now coupled to that vendor’s SDK version, their ingestion limits, their pricing tier, and their data retention policy.

The callback architecture itself does not require this. LangChain’s callback system is a general-purpose event hook interface. It knows nothing about LangSmith. It fires events — LLM start, LLM end, tool call, chain completion, error — and delivers the full payload to any handler you attach. You can implement a handler that writes to OpenTelemetry, to a Postgres table, to a local file, or to all three simultaneously. The vendor handlers are built on the same interface you have access to.

This post covers the callback architecture in enough detail to build a production-grade handler, how to integrate with OpenTelemetry without a vendor intermediary, what to trace and why, and the performance characteristics of callbacks in production.

How the Callback System Actually Works

LangChain’s callback system is implemented as a set of abstract base classes in langchain_core.callbacks. The core interface is BaseCallbackHandler, which defines methods for every event the framework can emit:

  • on_llm_start — fires before the LLM call, receives prompts and run metadata
  • on_llm_end — fires after the LLM call, receives LLMResult with generated text and any provider-reported token counts
  • on_llm_error — fires on LLM exception
  • on_chat_model_start — fires before a chat model call, receives message list
  • on_tool_start / on_tool_end / on_tool_error — tool-level hooks
  • on_chain_start / on_chain_end / on_chain_error — chain-level hooks
  • on_agent_action / on_agent_finish — agent-level hooks

Each method receives a run_id (a UUID unique to this invocation) and a parent_run_id (the UUID of the parent chain or agent, if any). This parent-child relationship is how you reconstruct execution trees: an agent run spawns a chain, the chain spawns an LLM call, each has a parent_run_id pointing up the tree.

Callbacks are attached in two ways. The first is globally via a CallbackManager — every chain invocation in the process picks up these handlers automatically. The second is per-invocation via the callbacks parameter on .invoke(), .stream(), or .astream() — useful for request-scoped handlers that carry per-request context like a user ID or request trace ID.

Principle: The parent_run_id chain is your distributed trace. Every event in a LangChain execution carries the UUID of its parent, which means you can reconstruct the full execution tree — agent → chain → tool → LLM — from a flat stream of events. This is the same model OpenTelemetry uses with span parent relationships. The callback system gives you the raw material; your handler decides how to store and query it.

Comparing Observability Approaches

Before building, the right architecture depends on your constraints. The table below covers the four realistic options:

ApproachSetup CostVendor DependencyData OwnershipQuery FlexibilityBest For
LangSmithLow — SDK + API keyHigh — SDK version, pricing tier, retention limitsNone — data lives on LangSmith serversLangSmith UI onlyPrototype / early development
Custom OpenTelemetryMedium — handler + collector configNone — OTel is a CNCF standardFull — you own the backendAny OTel-compatible backend (Jaeger, Tempo, Honeycomb, ClickHouse)Production systems with existing infra
Custom LoggingLow — structured log handlerNoneFullLimited to log query tools unless you parse and re-indexSimple systems, early production
Hybrid (OTel + local log)Medium-highNone for OTel; log backend of your choiceFullHigh — spans in OTel backend, full payloads in log storeSystems with compliance requirements or PII in prompts

For most production systems beyond early prototyping, the custom OpenTelemetry path gives you the best combination of flexibility, portability, and no structural vendor dependency.

Building the Custom Callback Handler

The handler below integrates with OpenTelemetry using the standard opentelemetry-sdk package. It uses Pydantic for trace configuration, which keeps the handler testable and separates concerns cleanly.

from __future__ import annotations
import time
from typing import Any, Optional, Union
from uuid import UUID
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.trace import Status, StatusCode
from pydantic import BaseModel, Field
class TraceConfig(BaseModel):
"""Configuration for what to capture in each span."""
capture_prompts: bool = Field(
default=True,
description="Include full prompt text in span attributes."
)
capture_completions: bool = Field(
default=True,
description="Include full completion text in span attributes."
)
capture_token_usage: bool = Field(
default=True,
description="Record prompt_tokens, completion_tokens, total_tokens."
)
max_prompt_chars: int = Field(
default=4000,
description="Truncate prompt attributes beyond this length."
)
prompt_version: Optional[str] = Field(
default=None,
description="Prompt version identifier, e.g. 'v1.3.2'. Attached to every LLM span."
)
service_name: str = Field(
default="langchain-agent",
description="OTel service.name attribute."
)
otlp_endpoint: str = Field(
default="http://localhost:4317",
description="OTLP gRPC collector endpoint."
)
def build_tracer_provider(config: TraceConfig) -> TracerProvider:
"""Configure OTel SDK with a BatchSpanProcessor (non-blocking export)."""
exporter = OTLPSpanExporter(endpoint=config.otlp_endpoint)
provider = TracerProvider()
# BatchSpanProcessor exports in a background thread — critical for latency.
provider.add_span_processor(BatchSpanProcessor(exporter))
return provider
class OTelCallbackHandler(BaseCallbackHandler):
"""
Custom LangChain callback handler that writes execution traces
to an OpenTelemetry-compatible backend via OTLP.
Usage:
config = TraceConfig(prompt_version="v2.1.0", capture_prompts=False)
handler = OTelCallbackHandler(config)
chain.invoke({"query": "..."}, config={"callbacks": [handler]})
"""
def __init__(self, config: TraceConfig) -> None:
self.config = config
provider = build_tracer_provider(config)
trace.set_tracer_provider(provider)
self._tracer = trace.get_tracer(config.service_name)
# Maps run_id → OTel span, so we can end the span in the matching hook.
self._spans: dict[UUID, Any] = {}
self._start_times: dict[UUID, float] = {}
# -------------------------------------------------------------------------
# LLM hooks
# -------------------------------------------------------------------------
def on_llm_start(
self,
serialized: dict[str, Any],
prompts: list[str],
*,
run_id: UUID,
parent_run_id: Optional[UUID] = None,
**kwargs: Any,
) -> None:
span = self._tracer.start_span(
name="llm.call",
attributes=self._base_attrs(serialized, parent_run_id),
)
if self.config.capture_prompts and prompts:
prompt_text = prompts[0][: self.config.max_prompt_chars]
span.set_attribute("llm.prompt", prompt_text)
if self.config.prompt_version:
span.set_attribute("llm.prompt_version", self.config.prompt_version)
model_name = serialized.get("kwargs", {}).get("model_name") or serialized.get("id", ["unknown"])[-1]
span.set_attribute("llm.model", model_name)
self._spans[run_id] = span
self._start_times[run_id] = time.perf_counter()
def on_llm_end(
self,
response: LLMResult,
*,
run_id: UUID,
**kwargs: Any,
) -> None:
span = self._spans.pop(run_id, None)
if span is None:
return
elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000
span.set_attribute("llm.latency_ms", round(elapsed_ms, 2))
if self.config.capture_token_usage and response.llm_output:
usage = response.llm_output.get("token_usage", {})
span.set_attribute("llm.tokens.prompt", usage.get("prompt_tokens", 0))
span.set_attribute("llm.tokens.completion", usage.get("completion_tokens", 0))
span.set_attribute("llm.tokens.total", usage.get("total_tokens", 0))
if self.config.capture_completions and response.generations:
text = response.generations[0][0].text[: self.config.max_prompt_chars]
span.set_attribute("llm.completion", text)
span.set_status(Status(StatusCode.OK))
span.end()
def on_llm_error(
self,
error: Union[Exception, KeyboardInterrupt],
*,
run_id: UUID,
**kwargs: Any,
) -> None:
span = self._spans.pop(run_id, None)
if span:
span.set_status(Status(StatusCode.ERROR, str(error)))
span.record_exception(error)
span.end()
self._start_times.pop(run_id, None)
# -------------------------------------------------------------------------
# Tool hooks
# -------------------------------------------------------------------------
def on_tool_start(
self,
serialized: dict[str, Any],
input_str: str,
*,
run_id: UUID,
parent_run_id: Optional[UUID] = None,
**kwargs: Any,
) -> None:
span = self._tracer.start_span(
name=f"tool.{serialized.get('name', 'unknown')}",
attributes={
**self._base_attrs(serialized, parent_run_id),
"tool.input": input_str[:500],
},
)
self._spans[run_id] = span
self._start_times[run_id] = time.perf_counter()
def on_tool_end(
self,
output: str,
*,
run_id: UUID,
**kwargs: Any,
) -> None:
span = self._spans.pop(run_id, None)
if span:
elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000
span.set_attribute("tool.latency_ms", round(elapsed_ms, 2))
span.set_attribute("tool.output", str(output)[:500])
span.set_status(Status(StatusCode.OK))
span.end()
def on_tool_error(
self,
error: Union[Exception, KeyboardInterrupt],
*,
run_id: UUID,
**kwargs: Any,
) -> None:
span = self._spans.pop(run_id, None)
if span:
span.set_status(Status(StatusCode.ERROR, str(error)))
span.record_exception(error)
span.end()
self._start_times.pop(run_id, None)
# -------------------------------------------------------------------------
# Chain hooks
# -------------------------------------------------------------------------
def on_chain_start(
self,
serialized: dict[str, Any],
inputs: dict[str, Any],
*,
run_id: UUID,
parent_run_id: Optional[UUID] = None,
**kwargs: Any,
) -> None:
span = self._tracer.start_span(
name=f"chain.{serialized.get('id', ['unknown'])[-1]}",
attributes=self._base_attrs(serialized, parent_run_id),
)
self._spans[run_id] = span
self._start_times[run_id] = time.perf_counter()
def on_chain_end(
self,
outputs: dict[str, Any],
*,
run_id: UUID,
**kwargs: Any,
) -> None:
span = self._spans.pop(run_id, None)
if span:
elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000
span.set_attribute("chain.latency_ms", round(elapsed_ms, 2))
span.set_status(Status(StatusCode.OK))
span.end()
def on_chain_error(
self,
error: Union[Exception, KeyboardInterrupt],
*,
run_id: UUID,
**kwargs: Any,
) -> None:
span = self._spans.pop(run_id, None)
if span:
span.set_status(Status(StatusCode.ERROR, str(error)))
span.record_exception(error)
span.end()
self._start_times.pop(run_id, None)
# -------------------------------------------------------------------------
# Helpers
# -------------------------------------------------------------------------
def _base_attrs(
self, serialized: dict[str, Any], parent_run_id: Optional[UUID]
) -> dict[str, Any]:
attrs: dict[str, Any] = {
"service.name": self.config.service_name,
}
if parent_run_id:
attrs["langchain.parent_run_id"] = str(parent_run_id)
return attrs

A few decisions in this implementation deserve explanation. The BatchSpanProcessor is non-negotiable for production: it buffers spans and exports them in a background thread, so the callback method returns immediately without waiting on network I/O. The synchronous SimpleSpanProcessor would add collector round-trip latency to every LLM call — that is the primary performance footgun in callback handler implementations.

The _spans dict maps run_id to an open span. This is the bridge between the start hook (where the span is opened) and the end hook (where it is closed). The pattern is identical across LLM, tool, and chain pairs. Keep this dict and keep it clean: always pop on end or error, never leave spans open.

What to Trace and Why

Not everything is worth capturing. The following signals have direct diagnostic or economic value in production:

Token usage per LLM call. This is the primary cost signal. Without it, you cannot attribute spend to specific chains, users, or prompt versions. When the model wrapper surfaces usage data, LLMResult.llm_output["token_usage"] is where you typically read prompt tokens, completion tokens, and total tokens. Aggregate by chain name, model, and prompt version. For the full cost attribution architecture beyond individual spans, see LLM cost observability: token-level attribution and spend patterns that signal trouble.

Latency at each layer. The callback system gives you start and end events at three levels: LLM, tool, and chain. Measure latency at all three. A slow chain with fast individual LLM calls indicates tool call overhead or inter-step processing. A slow LLM call with normal completion token count indicates model API latency, not your code.

Prompt version. When you change a prompt template, you need to know whether the change improved or degraded output quality. Attaching a version string to every LLM span — passed through run_metadata or hardcoded in the handler config — lets you filter traces by version in any OTel-compatible backend without a dedicated prompt registry. For the full treatment of prompt drift detection and the metrics that predict quality degradation before it reaches users, see prompt observability: versioning, drift detection, and the metrics that matter.

Chain execution path. For agent systems, the parent_run_id chain reconstructs which chains were executed, in what order, and how long each took. This is the trace you need when a user reports an unexpected response and you need to understand the execution path that produced it. The AI observability data model covers what queries this data supports.

Tool call payloads (with caution). Tool inputs and outputs are high-value debugging data, but they frequently contain PII or sensitive business data. The implementation above truncates at 500 characters. In regulated environments, consider hashing or redacting before attaching to spans — the callback gives you the full string, but you control what goes into the span attribute.

Warning: Do not capture full prompt text in spans that export to a third-party OTel collector without auditing the prompt content first. Prompts often contain user-supplied data, retrieved documents, or intermediate agent state that may include PII. The capture_prompts: false flag in TraceConfig disables prompt capture entirely. For compliance environments, set it to false and log prompts separately to a store with appropriate access controls.

Attaching Per-Request Context

Global handlers capture all invocations, but production systems need per-request context: user ID, session ID, tenant ID, request source. The right pattern is to create a request-scoped handler instance and pass it via the callbacks parameter:

from langchain_core.runnables.config import RunnableConfig
def handle_request(user_id: str, query: str, chain) -> str:
# Extend base config with request-scoped metadata.
request_config = TraceConfig(
prompt_version="v2.1.0",
service_name="langchain-agent",
otlp_endpoint="http://otel-collector:4317",
)
handler = OTelCallbackHandler(request_config)
# Attach user_id as a span attribute by subclassing or monkey-patching,
# or use baggage propagation if your OTel setup supports it.
result = chain.invoke(
{"query": query},
config=RunnableConfig(callbacks=[handler], metadata={"user_id": user_id}),
)
return result

The metadata dict on RunnableConfig is passed through to callback hooks via kwargs["metadata"]. You can read it inside on_chain_start or on_llm_start and attach it to the span. This keeps user context in the trace without requiring a global context variable.

Avoiding Vendor Lock-In in Practice

The handler above has zero vendor-specific imports. It depends on opentelemetry-sdk and opentelemetry-exporter-otlp, both CNCF standards. The exporter endpoint is configuration — you point it at Jaeger, Grafana Tempo, Honeycomb, or your own ClickHouse table by changing a URL. Moving from one compatible backend to another should be a configuration change rather than a handler rewrite.

Contrast this with LangChainTracer from the LangSmith SDK: it imports from langsmith, writes to LangSmith’s ingestion endpoint, uses LangSmith’s trace schema, and requires a LangSmith API key. If LangSmith changes their SDK interface, raises prices, or has an outage, your observability pipeline breaks. The LangSmith SDK version pins become a dependency constraint across your team.

This is not an argument against LangSmith during development. The UI is genuinely useful for interactive debugging. The argument is that LangSmith should be an optional consumer of your trace data, not the only store. The pattern: emit OTel spans from your custom handler to a collector, configure a forwarder or exporter that also sends to LangSmith if needed. You get both the vendor’s UI and full data ownership.

The surviving LangChain version upgrades post documents how vendor SDK dependencies compound during framework upgrades — the callback interface itself is stable across versions, but third-party handler implementations frequently break on minor LangChain releases.

Performance Impact of Callbacks

The callback invocation itself is an in-process Python method call. What matters is the handler body. The three patterns that add meaningful latency:

  1. Synchronous network calls in the handler. Calling a remote API, flushing to a database, or pushing to a queue synchronously in the handler body adds that call’s latency to the chain’s critical path. Use BatchSpanProcessor for OTel, or push to an in-process queue and drain it in a background thread.

  2. String operations on large payloads. If you are logging 32K-token prompts, the string operations inside the handler can be measurable. The max_prompt_chars truncation in TraceConfig controls this. For context window economics at scale, see the context window economics post.

  3. Span dict operations on high-frequency chains. If a chain invokes hundreds of sub-chains in rapid succession (e.g., batch processing), the _spans dict grows and shrinks rapidly. The dict operations are O(1) but at sufficient volume the garbage collector pressure is measurable. In extreme cases, profile before assuming the handler is the bottleneck.

In practice, with BatchSpanProcessor and prompt truncation configured, the callback work is usually limited to local method execution and a queue append. For most agent architectures, remote model latency, tool latency, and payload size dominate the callback overhead.

For debugging failures in agent systems where the trace alone is insufficient, the debugging CrewAI agent failures post covers complementary techniques for reconstructing what happened when the trace is incomplete.

Checklist Before Putting This in Production

  • Verify BatchSpanProcessor is configured — confirm no SimpleSpanProcessor in the exporter chain
  • Set capture_prompts: false or configure truncation to max_prompt_chars before enabling prompt logging in production
  • Confirm prompt version strings are attached to all LLM spans — verify by querying for spans without the llm.prompt_version attribute
  • Test error paths: introduce a deliberate LLM error and verify the span closes with ERROR status and the exception is recorded
  • Confirm the _spans dict does not grow unboundedly — add monitoring on its length in high-throughput systems
  • Validate parent_run_id linkage by running a multi-step chain and confirming the trace reconstructs correctly in your OTel backend
  • Review what data flows through tool spans for PII before enabling tool output logging in environments with user-supplied data

What This Gives You That Vendor Handlers Do Not

A custom handler gives you three things a vendor handler cannot:

First, full control over the data pipeline. You decide what gets captured, what gets redacted, where it goes, and how long it is retained. If your data governance policy requires that prompts containing user data never leave your VPC, you enforce that in the handler body.

Second, composability. You can run multiple handlers simultaneously — one that writes to OTel, one that writes latency metrics to Prometheus, one that publishes to an internal audit log. Vendor handlers are designed for one destination.

Third, a cleaner upgrade boundary across the LangChain release cycle. The BaseCallbackHandler interface is a narrower dependency than a full vendor handler SDK. You still need version-pin tests around callback signatures, but your instrumentation layer is easier to adapt when it only depends on the framework hooks you use directly.

The callback architecture is the right layer to instrument. It fires on the semantically meaningful events in LangChain’s execution model — not at the HTTP request level, not at the Python function call level, but at the points where an LLM was called, a tool was used, a chain completed. That is the data model you need to answer the questions that matter in production.


What is the LangChain callback system and how does it work?

LangChain's callback system is an event hook architecture that fires on every significant event in a chain or agent run: LLM start, LLM end, tool start, tool end, chain start, chain end, and on errors at each level. Callback handlers implement these hooks as Python methods, and LangChain invokes them automatically. You can attach handlers globally (via CallbackManager) or per-run (via the callbacks parameter). Multiple handlers can be active simultaneously, so you can ship to OpenTelemetry and a local log store at the same time.

Does building a custom callback handler meaningfully impact production latency?

It depends on what the handler does. The callback invocation itself is an in-process Python method call; the meaningful cost usually comes from the handler body. If you synchronously flush a span to a remote collector on every LLM end event, you add that network round-trip to the chain's critical path. The correct pattern is to use background threads or async exporters so the handler returns immediately. With a BatchSpanProcessor configured correctly, the overhead is usually limited to local method work and a queue append rather than remote I/O.

Can LangChain callbacks capture prompt content, or only metadata?

Callback hooks expose the payloads LangChain passes into each event. The on_llm_start hook receives the list of prompts sent to the model. The on_llm_end hook receives the LLMResult object, which can include generated text, provider-reported token usage, and model metadata. The on_chat_model_start hook receives the list of message objects. You can log, hash, truncate, or redact this content inside the handler, subject to what each model wrapper and provider response makes available.

How do you version prompts in a LangChain callback system without a third-party platform?

Attach prompt version metadata as a tag or attribute on the span. The standard pattern is to define a PromptVersion enum or dataclass in your codebase, pass the version identifier through the run_metadata dict when invoking the chain, and read it inside on_llm_start to attach it to the span. When you change a prompt template, increment the version. This lets you filter traces by prompt version in any OTel-compatible backend (Jaeger, Grafana Tempo, Honeycomb, your own ClickHouse table) without depending on a vendor's prompt registry.


The decision rule

If you are building LangChain-based agents for production, instrument custom tracing, token cost attribution, prompt versioning, and vendor-independent trace storage before the system becomes hard to inspect. The Enterprise Agentic Assessment Kit includes an observability instrumentation checklist alongside production-readiness criteria for callback architecture.

Technical Review

Bring the system under review

Send the system context, constraints, and pressure. A Principal Engineer reviews it and recommends the next step.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

About the author

Igor Bobriakov

AI Architect. Author of Production-Ready AI Agents. 15 years deploying production AI platforms and agentic systems for enterprise clients and deep-tech startups.