Most LangChain observability setups lock you into a vendor’s callback handler. LangSmith’s LangChainTracer, Langfuse’s handler, Weights & Biases — each ships a pre-built handler that wires your chain’s events to their platform. The convenience is real, but the dependency is structural: your observability pipeline is now coupled to that vendor’s SDK version, their ingestion limits, their pricing tier, and their data retention policy.
The callback architecture itself does not require this. LangChain’s callback system is a general-purpose event hook interface. It knows nothing about LangSmith. It fires events — LLM start, LLM end, tool call, chain completion, error — and delivers the full payload to any handler you attach. You can implement a handler that writes to OpenTelemetry, to a Postgres table, to a local file, or to all three simultaneously. The vendor handlers are built on the same interface you have access to.
This post covers the callback architecture in enough detail to build a production-grade handler, how to integrate with OpenTelemetry without a vendor intermediary, what to trace and why, and the performance characteristics of callbacks in production.
How the Callback System Actually Works
LangChain’s callback system is implemented as a set of abstract base classes in langchain_core.callbacks. The core interface is BaseCallbackHandler, which defines methods for every event the framework can emit:
on_llm_start— fires before the LLM call, receives prompts and run metadataon_llm_end— fires after the LLM call, receivesLLMResultwith generated text and any provider-reported token countson_llm_error— fires on LLM exceptionon_chat_model_start— fires before a chat model call, receives message liston_tool_start/on_tool_end/on_tool_error— tool-level hookson_chain_start/on_chain_end/on_chain_error— chain-level hookson_agent_action/on_agent_finish— agent-level hooks
Each method receives a run_id (a UUID unique to this invocation) and a parent_run_id (the UUID of the parent chain or agent, if any). This parent-child relationship is how you reconstruct execution trees: an agent run spawns a chain, the chain spawns an LLM call, each has a parent_run_id pointing up the tree.
Callbacks are attached in two ways. The first is globally via a CallbackManager — every chain invocation in the process picks up these handlers automatically. The second is per-invocation via the callbacks parameter on .invoke(), .stream(), or .astream() — useful for request-scoped handlers that carry per-request context like a user ID or request trace ID.
Comparing Observability Approaches
Before building, the right architecture depends on your constraints. The table below covers the four realistic options:
| Approach | Setup Cost | Vendor Dependency | Data Ownership | Query Flexibility | Best For |
|---|---|---|---|---|---|
| LangSmith | Low — SDK + API key | High — SDK version, pricing tier, retention limits | None — data lives on LangSmith servers | LangSmith UI only | Prototype / early development |
| Custom OpenTelemetry | Medium — handler + collector config | None — OTel is a CNCF standard | Full — you own the backend | Any OTel-compatible backend (Jaeger, Tempo, Honeycomb, ClickHouse) | Production systems with existing infra |
| Custom Logging | Low — structured log handler | None | Full | Limited to log query tools unless you parse and re-index | Simple systems, early production |
| Hybrid (OTel + local log) | Medium-high | None for OTel; log backend of your choice | Full | High — spans in OTel backend, full payloads in log store | Systems with compliance requirements or PII in prompts |
For most production systems beyond early prototyping, the custom OpenTelemetry path gives you the best combination of flexibility, portability, and no structural vendor dependency.
Building the Custom Callback Handler
The handler below integrates with OpenTelemetry using the standard opentelemetry-sdk package. It uses Pydantic for trace configuration, which keeps the handler testable and separates concerns cleanly.
from __future__ import annotations
import timefrom typing import Any, Optional, Unionfrom uuid import UUID
from langchain_core.callbacks import BaseCallbackHandlerfrom langchain_core.outputs import LLMResultfrom opentelemetry import tracefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.trace import Status, StatusCodefrom pydantic import BaseModel, Field
class TraceConfig(BaseModel): """Configuration for what to capture in each span."""
capture_prompts: bool = Field( default=True, description="Include full prompt text in span attributes." ) capture_completions: bool = Field( default=True, description="Include full completion text in span attributes." ) capture_token_usage: bool = Field( default=True, description="Record prompt_tokens, completion_tokens, total_tokens." ) max_prompt_chars: int = Field( default=4000, description="Truncate prompt attributes beyond this length." ) prompt_version: Optional[str] = Field( default=None, description="Prompt version identifier, e.g. 'v1.3.2'. Attached to every LLM span." ) service_name: str = Field( default="langchain-agent", description="OTel service.name attribute." ) otlp_endpoint: str = Field( default="http://localhost:4317", description="OTLP gRPC collector endpoint." )
def build_tracer_provider(config: TraceConfig) -> TracerProvider: """Configure OTel SDK with a BatchSpanProcessor (non-blocking export).""" exporter = OTLPSpanExporter(endpoint=config.otlp_endpoint) provider = TracerProvider() # BatchSpanProcessor exports in a background thread — critical for latency. provider.add_span_processor(BatchSpanProcessor(exporter)) return provider
class OTelCallbackHandler(BaseCallbackHandler): """ Custom LangChain callback handler that writes execution traces to an OpenTelemetry-compatible backend via OTLP.
Usage: config = TraceConfig(prompt_version="v2.1.0", capture_prompts=False) handler = OTelCallbackHandler(config) chain.invoke({"query": "..."}, config={"callbacks": [handler]}) """
def __init__(self, config: TraceConfig) -> None: self.config = config provider = build_tracer_provider(config) trace.set_tracer_provider(provider) self._tracer = trace.get_tracer(config.service_name) # Maps run_id → OTel span, so we can end the span in the matching hook. self._spans: dict[UUID, Any] = {} self._start_times: dict[UUID, float] = {}
# ------------------------------------------------------------------------- # LLM hooks # -------------------------------------------------------------------------
def on_llm_start( self, serialized: dict[str, Any], prompts: list[str], *, run_id: UUID, parent_run_id: Optional[UUID] = None, **kwargs: Any, ) -> None: span = self._tracer.start_span( name="llm.call", attributes=self._base_attrs(serialized, parent_run_id), ) if self.config.capture_prompts and prompts: prompt_text = prompts[0][: self.config.max_prompt_chars] span.set_attribute("llm.prompt", prompt_text) if self.config.prompt_version: span.set_attribute("llm.prompt_version", self.config.prompt_version)
model_name = serialized.get("kwargs", {}).get("model_name") or serialized.get("id", ["unknown"])[-1] span.set_attribute("llm.model", model_name)
self._spans[run_id] = span self._start_times[run_id] = time.perf_counter()
def on_llm_end( self, response: LLMResult, *, run_id: UUID, **kwargs: Any, ) -> None: span = self._spans.pop(run_id, None) if span is None: return
elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000 span.set_attribute("llm.latency_ms", round(elapsed_ms, 2))
if self.config.capture_token_usage and response.llm_output: usage = response.llm_output.get("token_usage", {}) span.set_attribute("llm.tokens.prompt", usage.get("prompt_tokens", 0)) span.set_attribute("llm.tokens.completion", usage.get("completion_tokens", 0)) span.set_attribute("llm.tokens.total", usage.get("total_tokens", 0))
if self.config.capture_completions and response.generations: text = response.generations[0][0].text[: self.config.max_prompt_chars] span.set_attribute("llm.completion", text)
span.set_status(Status(StatusCode.OK)) span.end()
def on_llm_error( self, error: Union[Exception, KeyboardInterrupt], *, run_id: UUID, **kwargs: Any, ) -> None: span = self._spans.pop(run_id, None) if span: span.set_status(Status(StatusCode.ERROR, str(error))) span.record_exception(error) span.end() self._start_times.pop(run_id, None)
# ------------------------------------------------------------------------- # Tool hooks # -------------------------------------------------------------------------
def on_tool_start( self, serialized: dict[str, Any], input_str: str, *, run_id: UUID, parent_run_id: Optional[UUID] = None, **kwargs: Any, ) -> None: span = self._tracer.start_span( name=f"tool.{serialized.get('name', 'unknown')}", attributes={ **self._base_attrs(serialized, parent_run_id), "tool.input": input_str[:500], }, ) self._spans[run_id] = span self._start_times[run_id] = time.perf_counter()
def on_tool_end( self, output: str, *, run_id: UUID, **kwargs: Any, ) -> None: span = self._spans.pop(run_id, None) if span: elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000 span.set_attribute("tool.latency_ms", round(elapsed_ms, 2)) span.set_attribute("tool.output", str(output)[:500]) span.set_status(Status(StatusCode.OK)) span.end()
def on_tool_error( self, error: Union[Exception, KeyboardInterrupt], *, run_id: UUID, **kwargs: Any, ) -> None: span = self._spans.pop(run_id, None) if span: span.set_status(Status(StatusCode.ERROR, str(error))) span.record_exception(error) span.end() self._start_times.pop(run_id, None)
# ------------------------------------------------------------------------- # Chain hooks # -------------------------------------------------------------------------
def on_chain_start( self, serialized: dict[str, Any], inputs: dict[str, Any], *, run_id: UUID, parent_run_id: Optional[UUID] = None, **kwargs: Any, ) -> None: span = self._tracer.start_span( name=f"chain.{serialized.get('id', ['unknown'])[-1]}", attributes=self._base_attrs(serialized, parent_run_id), ) self._spans[run_id] = span self._start_times[run_id] = time.perf_counter()
def on_chain_end( self, outputs: dict[str, Any], *, run_id: UUID, **kwargs: Any, ) -> None: span = self._spans.pop(run_id, None) if span: elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000 span.set_attribute("chain.latency_ms", round(elapsed_ms, 2)) span.set_status(Status(StatusCode.OK)) span.end()
def on_chain_error( self, error: Union[Exception, KeyboardInterrupt], *, run_id: UUID, **kwargs: Any, ) -> None: span = self._spans.pop(run_id, None) if span: span.set_status(Status(StatusCode.ERROR, str(error))) span.record_exception(error) span.end() self._start_times.pop(run_id, None)
# ------------------------------------------------------------------------- # Helpers # -------------------------------------------------------------------------
def _base_attrs( self, serialized: dict[str, Any], parent_run_id: Optional[UUID] ) -> dict[str, Any]: attrs: dict[str, Any] = { "service.name": self.config.service_name, } if parent_run_id: attrs["langchain.parent_run_id"] = str(parent_run_id) return attrsA few decisions in this implementation deserve explanation. The BatchSpanProcessor is non-negotiable for production: it buffers spans and exports them in a background thread, so the callback method returns immediately without waiting on network I/O. The synchronous SimpleSpanProcessor would add collector round-trip latency to every LLM call — that is the primary performance footgun in callback handler implementations.
The _spans dict maps run_id to an open span. This is the bridge between the start hook (where the span is opened) and the end hook (where it is closed). The pattern is identical across LLM, tool, and chain pairs. Keep this dict and keep it clean: always pop on end or error, never leave spans open.
What to Trace and Why
Not everything is worth capturing. The following signals have direct diagnostic or economic value in production:
Token usage per LLM call. This is the primary cost signal. Without it, you cannot attribute spend to specific chains, users, or prompt versions. When the model wrapper surfaces usage data, LLMResult.llm_output["token_usage"] is where you typically read prompt tokens, completion tokens, and total tokens. Aggregate by chain name, model, and prompt version. For the full cost attribution architecture beyond individual spans, see LLM cost observability: token-level attribution and spend patterns that signal trouble.
Latency at each layer. The callback system gives you start and end events at three levels: LLM, tool, and chain. Measure latency at all three. A slow chain with fast individual LLM calls indicates tool call overhead or inter-step processing. A slow LLM call with normal completion token count indicates model API latency, not your code.
Prompt version. When you change a prompt template, you need to know whether the change improved or degraded output quality. Attaching a version string to every LLM span — passed through run_metadata or hardcoded in the handler config — lets you filter traces by version in any OTel-compatible backend without a dedicated prompt registry. For the full treatment of prompt drift detection and the metrics that predict quality degradation before it reaches users, see prompt observability: versioning, drift detection, and the metrics that matter.
Chain execution path. For agent systems, the parent_run_id chain reconstructs which chains were executed, in what order, and how long each took. This is the trace you need when a user reports an unexpected response and you need to understand the execution path that produced it. The AI observability data model covers what queries this data supports.
Tool call payloads (with caution). Tool inputs and outputs are high-value debugging data, but they frequently contain PII or sensitive business data. The implementation above truncates at 500 characters. In regulated environments, consider hashing or redacting before attaching to spans — the callback gives you the full string, but you control what goes into the span attribute.
capture_prompts: false flag in TraceConfig disables prompt capture entirely. For compliance environments, set it to false and log prompts separately to a store with appropriate access controls.Attaching Per-Request Context
Global handlers capture all invocations, but production systems need per-request context: user ID, session ID, tenant ID, request source. The right pattern is to create a request-scoped handler instance and pass it via the callbacks parameter:
from langchain_core.runnables.config import RunnableConfig
def handle_request(user_id: str, query: str, chain) -> str: # Extend base config with request-scoped metadata. request_config = TraceConfig( prompt_version="v2.1.0", service_name="langchain-agent", otlp_endpoint="http://otel-collector:4317", ) handler = OTelCallbackHandler(request_config)
# Attach user_id as a span attribute by subclassing or monkey-patching, # or use baggage propagation if your OTel setup supports it. result = chain.invoke( {"query": query}, config=RunnableConfig(callbacks=[handler], metadata={"user_id": user_id}), ) return resultThe metadata dict on RunnableConfig is passed through to callback hooks via kwargs["metadata"]. You can read it inside on_chain_start or on_llm_start and attach it to the span. This keeps user context in the trace without requiring a global context variable.
Avoiding Vendor Lock-In in Practice
The handler above has zero vendor-specific imports. It depends on opentelemetry-sdk and opentelemetry-exporter-otlp, both CNCF standards. The exporter endpoint is configuration — you point it at Jaeger, Grafana Tempo, Honeycomb, or your own ClickHouse table by changing a URL. Moving from one compatible backend to another should be a configuration change rather than a handler rewrite.
Contrast this with LangChainTracer from the LangSmith SDK: it imports from langsmith, writes to LangSmith’s ingestion endpoint, uses LangSmith’s trace schema, and requires a LangSmith API key. If LangSmith changes their SDK interface, raises prices, or has an outage, your observability pipeline breaks. The LangSmith SDK version pins become a dependency constraint across your team.
This is not an argument against LangSmith during development. The UI is genuinely useful for interactive debugging. The argument is that LangSmith should be an optional consumer of your trace data, not the only store. The pattern: emit OTel spans from your custom handler to a collector, configure a forwarder or exporter that also sends to LangSmith if needed. You get both the vendor’s UI and full data ownership.
The surviving LangChain version upgrades post documents how vendor SDK dependencies compound during framework upgrades — the callback interface itself is stable across versions, but third-party handler implementations frequently break on minor LangChain releases.
Performance Impact of Callbacks
The callback invocation itself is an in-process Python method call. What matters is the handler body. The three patterns that add meaningful latency:
Synchronous network calls in the handler. Calling a remote API, flushing to a database, or pushing to a queue synchronously in the handler body adds that call’s latency to the chain’s critical path. Use
BatchSpanProcessorfor OTel, or push to an in-process queue and drain it in a background thread.String operations on large payloads. If you are logging 32K-token prompts, the string operations inside the handler can be measurable. The
max_prompt_charstruncation inTraceConfigcontrols this. For context window economics at scale, see the context window economics post.Span dict operations on high-frequency chains. If a chain invokes hundreds of sub-chains in rapid succession (e.g., batch processing), the
_spansdict grows and shrinks rapidly. The dict operations are O(1) but at sufficient volume the garbage collector pressure is measurable. In extreme cases, profile before assuming the handler is the bottleneck.
In practice, with BatchSpanProcessor and prompt truncation configured, the callback work is usually limited to local method execution and a queue append. For most agent architectures, remote model latency, tool latency, and payload size dominate the callback overhead.
For debugging failures in agent systems where the trace alone is insufficient, the debugging CrewAI agent failures post covers complementary techniques for reconstructing what happened when the trace is incomplete.
Checklist Before Putting This in Production
- Verify
BatchSpanProcessoris configured — confirm noSimpleSpanProcessorin the exporter chain - Set
capture_prompts: falseor configure truncation tomax_prompt_charsbefore enabling prompt logging in production - Confirm prompt version strings are attached to all LLM spans — verify by querying for spans without the
llm.prompt_versionattribute - Test error paths: introduce a deliberate LLM error and verify the span closes with ERROR status and the exception is recorded
- Confirm the
_spansdict does not grow unboundedly — add monitoring on its length in high-throughput systems - Validate parent_run_id linkage by running a multi-step chain and confirming the trace reconstructs correctly in your OTel backend
- Review what data flows through tool spans for PII before enabling tool output logging in environments with user-supplied data
What This Gives You That Vendor Handlers Do Not
A custom handler gives you three things a vendor handler cannot:
First, full control over the data pipeline. You decide what gets captured, what gets redacted, where it goes, and how long it is retained. If your data governance policy requires that prompts containing user data never leave your VPC, you enforce that in the handler body.
Second, composability. You can run multiple handlers simultaneously — one that writes to OTel, one that writes latency metrics to Prometheus, one that publishes to an internal audit log. Vendor handlers are designed for one destination.
Third, a cleaner upgrade boundary across the LangChain release cycle. The BaseCallbackHandler interface is a narrower dependency than a full vendor handler SDK. You still need version-pin tests around callback signatures, but your instrumentation layer is easier to adapt when it only depends on the framework hooks you use directly.
The callback architecture is the right layer to instrument. It fires on the semantically meaningful events in LangChain’s execution model — not at the HTTP request level, not at the Python function call level, but at the points where an LLM was called, a tool was used, a chain completed. That is the data model you need to answer the questions that matter in production.
What is the LangChain callback system and how does it work?
LangChain's callback system is an event hook architecture that fires on every significant event in a chain or agent run: LLM start, LLM end, tool start, tool end, chain start, chain end, and on errors at each level. Callback handlers implement these hooks as Python methods, and LangChain invokes them automatically. You can attach handlers globally (via CallbackManager) or per-run (via the callbacks parameter). Multiple handlers can be active simultaneously, so you can ship to OpenTelemetry and a local log store at the same time.
Does building a custom callback handler meaningfully impact production latency?
It depends on what the handler does. The callback invocation itself is an in-process Python method call; the meaningful cost usually comes from the handler body. If you synchronously flush a span to a remote collector on every LLM end event, you add that network round-trip to the chain's critical path. The correct pattern is to use background threads or async exporters so the handler returns immediately. With a BatchSpanProcessor configured correctly, the overhead is usually limited to local method work and a queue append rather than remote I/O.
Can LangChain callbacks capture prompt content, or only metadata?
Callback hooks expose the payloads LangChain passes into each event. The on_llm_start hook receives the list of prompts sent to the model. The on_llm_end hook receives the LLMResult object, which can include generated text, provider-reported token usage, and model metadata. The on_chat_model_start hook receives the list of message objects. You can log, hash, truncate, or redact this content inside the handler, subject to what each model wrapper and provider response makes available.
How do you version prompts in a LangChain callback system without a third-party platform?
Attach prompt version metadata as a tag or attribute on the span. The standard pattern is to define a PromptVersion enum or dataclass in your codebase, pass the version identifier through the run_metadata dict when invoking the chain, and read it inside on_llm_start to attach it to the span. When you change a prompt template, increment the version. This lets you filter traces by prompt version in any OTel-compatible backend (Jaeger, Grafana Tempo, Honeycomb, your own ClickHouse table) without depending on a vendor's prompt registry.
The decision rule
If you are building LangChain-based agents for production, instrument custom tracing, token cost attribution, prompt versioning, and vendor-independent trace storage before the system becomes hard to inspect. The Enterprise Agentic Assessment Kit includes an observability instrumentation checklist alongside production-readiness criteria for callback architecture.