LangChain Callback Architecture for Vendor-Free Observability

Most LangChain observability setups lock you into a vendor’s callback handler. LangSmith’s LangChainTracer, Langfuse’s handler, Weights & Biases — each ships a pre-built handler that wires your chain’s events to their platform. The convenience is real, but the dependency is structural: your observability pipeline is now coupled to that vendor’s SDK version, their ingestion limits, their pricing tier, and their data retention policy.

The callback architecture itself does not require this. LangChain’s callback system is a general-purpose event hook interface. It knows nothing about LangSmith. It fires events — LLM start, LLM end, tool call, chain completion, error — and delivers the full payload to any handler you attach. You can implement a handler that writes to OpenTelemetry, to a Postgres table, to a local file, or to all three simultaneously. The vendor handlers are built on the same interface you have access to.

This post covers the callback architecture in enough detail to build a production-grade handler, how to integrate with OpenTelemetry without a vendor intermediary, what to trace and why, and the performance characteristics of callbacks in production.

How the Callback System Actually Works

LangChain’s callback system is implemented as a set of abstract base classes in langchain_core.callbacks. The core interface is BaseCallbackHandler, which defines methods for every event the framework can emit:

on_llm_start — fires before the LLM call, receives prompts and run metadata
on_llm_end — fires after the LLM call, receives LLMResult with generated text and any provider-reported token counts
on_llm_error — fires on LLM exception
on_chat_model_start — fires before a chat model call, receives message list
on_tool_start / on_tool_end / on_tool_error — tool-level hooks
on_chain_start / on_chain_end / on_chain_error — chain-level hooks
on_agent_action / on_agent_finish — agent-level hooks

Each method receives a run_id (a UUID unique to this invocation) and a parent_run_id (the UUID of the parent chain or agent, if any). This parent-child relationship is how you reconstruct execution trees: an agent run spawns a chain, the chain spawns an LLM call, each has a parent_run_id pointing up the tree.

Callbacks are attached in two ways. The first is globally via a CallbackManager — every chain invocation in the process picks up these handlers automatically. The second is per-invocation via the callbacks parameter on .invoke(), .stream(), or .astream() — useful for request-scoped handlers that carry per-request context like a user ID or request trace ID.

Principle: The parent_run_id chain is your distributed trace. Every event in a LangChain execution carries the UUID of its parent, which means you can reconstruct the full execution tree — agent → chain → tool → LLM — from a flat stream of events. This is the same model OpenTelemetry uses with span parent relationships. The callback system gives you the raw material; your handler decides how to store and query it.

Comparing Observability Approaches

Before building, the right architecture depends on your constraints. The table below covers the four realistic options:

Approach	Setup Cost	Vendor Dependency	Data Ownership	Query Flexibility	Best For
LangSmith	Low — SDK + API key	High — SDK version, pricing tier, retention limits	None — data lives on LangSmith servers	LangSmith UI only	Prototype / early development
Custom OpenTelemetry	Medium — handler + collector config	None — OTel is a CNCF standard	Full — you own the backend	Any OTel-compatible backend (Jaeger, Tempo, Honeycomb, ClickHouse)	Production systems with existing infra
Custom Logging	Low — structured log handler	None	Full	Limited to log query tools unless you parse and re-index	Simple systems, early production
Hybrid (OTel + local log)	Medium-high	None for OTel; log backend of your choice	Full	High — spans in OTel backend, full payloads in log store	Systems with compliance requirements or PII in prompts

For most production systems beyond early prototyping, the custom OpenTelemetry path gives you the best combination of flexibility, portability, and no structural vendor dependency.

Building the Custom Callback Handler

The handler below integrates with OpenTelemetry using the standard opentelemetry-sdk package. It uses Pydantic for trace configuration, which keeps the handler testable and separates concerns cleanly.

from __future__ import annotations

import time
from typing import Any, Optional, Union
from uuid import UUID

from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.trace import Status, StatusCode
from pydantic import BaseModel, Field


class TraceConfig(BaseModel):
    """Configuration for what to capture in each span."""

    capture_prompts: bool = Field(
        default=True,
        description="Include full prompt text in span attributes."
    )
    capture_completions: bool = Field(
        default=True,
        description="Include full completion text in span attributes."
    )
    capture_token_usage: bool = Field(
        default=True,
        description="Record prompt_tokens, completion_tokens, total_tokens."
    )
    max_prompt_chars: int = Field(
        default=4000,
        description="Truncate prompt attributes beyond this length."
    )
    prompt_version: Optional[str] = Field(
        default=None,
        description="Prompt version identifier, e.g. 'v1.3.2'. Attached to every LLM span."
    )
    service_name: str = Field(
        default="langchain-agent",
        description="OTel service.name attribute."
    )
    otlp_endpoint: str = Field(
        default="http://localhost:4317",
        description="OTLP gRPC collector endpoint."
    )


def build_tracer_provider(config: TraceConfig) -> TracerProvider:
    """Configure OTel SDK with a BatchSpanProcessor (non-blocking export)."""
    exporter = OTLPSpanExporter(endpoint=config.otlp_endpoint)
    provider = TracerProvider()
    # BatchSpanProcessor exports in a background thread — critical for latency.
    provider.add_span_processor(BatchSpanProcessor(exporter))
    return provider


class OTelCallbackHandler(BaseCallbackHandler):
    """
    Custom LangChain callback handler that writes execution traces
    to an OpenTelemetry-compatible backend via OTLP.

    Usage:
        config = TraceConfig(prompt_version="v2.1.0", capture_prompts=False)
        handler = OTelCallbackHandler(config)
        chain.invoke({"query": "..."}, config={"callbacks": [handler]})
    """

    def __init__(self, config: TraceConfig) -> None:
        self.config = config
        provider = build_tracer_provider(config)
        trace.set_tracer_provider(provider)
        self._tracer = trace.get_tracer(config.service_name)
        # Maps run_id → OTel span, so we can end the span in the matching hook.
        self._spans: dict[UUID, Any] = {}
        self._start_times: dict[UUID, float] = {}

    # -------------------------------------------------------------------------
    # LLM hooks
    # -------------------------------------------------------------------------

    def on_llm_start(
        self,
        serialized: dict[str, Any],
        prompts: list[str],
        *,
        run_id: UUID,
        parent_run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> None:
        span = self._tracer.start_span(
            name="llm.call",
            attributes=self._base_attrs(serialized, parent_run_id),
        )
        if self.config.capture_prompts and prompts:
            prompt_text = prompts[0][: self.config.max_prompt_chars]
            span.set_attribute("llm.prompt", prompt_text)
        if self.config.prompt_version:
            span.set_attribute("llm.prompt_version", self.config.prompt_version)

        model_name = serialized.get("kwargs", {}).get("model_name") or serialized.get("id", ["unknown"])[-1]
        span.set_attribute("llm.model", model_name)

        self._spans[run_id] = span
        self._start_times[run_id] = time.perf_counter()

    def on_llm_end(
        self,
        response: LLMResult,
        *,
        run_id: UUID,
        **kwargs: Any,
    ) -> None:
        span = self._spans.pop(run_id, None)
        if span is None:
            return

        elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000
        span.set_attribute("llm.latency_ms", round(elapsed_ms, 2))

        if self.config.capture_token_usage and response.llm_output:
            usage = response.llm_output.get("token_usage", {})
            span.set_attribute("llm.tokens.prompt", usage.get("prompt_tokens", 0))
            span.set_attribute("llm.tokens.completion", usage.get("completion_tokens", 0))
            span.set_attribute("llm.tokens.total", usage.get("total_tokens", 0))

        if self.config.capture_completions and response.generations:
            text = response.generations[0][0].text[: self.config.max_prompt_chars]
            span.set_attribute("llm.completion", text)

        span.set_status(Status(StatusCode.OK))
        span.end()

    def on_llm_error(
        self,
        error: Union[Exception, KeyboardInterrupt],
        *,
        run_id: UUID,
        **kwargs: Any,
    ) -> None:
        span = self._spans.pop(run_id, None)
        if span:
            span.set_status(Status(StatusCode.ERROR, str(error)))
            span.record_exception(error)
            span.end()
        self._start_times.pop(run_id, None)

    # -------------------------------------------------------------------------
    # Tool hooks
    # -------------------------------------------------------------------------

    def on_tool_start(
        self,
        serialized: dict[str, Any],
        input_str: str,
        *,
        run_id: UUID,
        parent_run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> None:
        span = self._tracer.start_span(
            name=f"tool.{serialized.get('name', 'unknown')}",
            attributes={
                **self._base_attrs(serialized, parent_run_id),
                "tool.input": input_str[:500],
            },
        )
        self._spans[run_id] = span
        self._start_times[run_id] = time.perf_counter()

    def on_tool_end(
        self,
        output: str,
        *,
        run_id: UUID,
        **kwargs: Any,
    ) -> None:
        span = self._spans.pop(run_id, None)
        if span:
            elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000
            span.set_attribute("tool.latency_ms", round(elapsed_ms, 2))
            span.set_attribute("tool.output", str(output)[:500])
            span.set_status(Status(StatusCode.OK))
            span.end()

    def on_tool_error(
        self,
        error: Union[Exception, KeyboardInterrupt],
        *,
        run_id: UUID,
        **kwargs: Any,
    ) -> None:
        span = self._spans.pop(run_id, None)
        if span:
            span.set_status(Status(StatusCode.ERROR, str(error)))
            span.record_exception(error)
            span.end()
        self._start_times.pop(run_id, None)

    # -------------------------------------------------------------------------
    # Chain hooks
    # -------------------------------------------------------------------------

    def on_chain_start(
        self,
        serialized: dict[str, Any],
        inputs: dict[str, Any],
        *,
        run_id: UUID,
        parent_run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> None:
        span = self._tracer.start_span(
            name=f"chain.{serialized.get('id', ['unknown'])[-1]}",
            attributes=self._base_attrs(serialized, parent_run_id),
        )
        self._spans[run_id] = span
        self._start_times[run_id] = time.perf_counter()

    def on_chain_end(
        self,
        outputs: dict[str, Any],
        *,
        run_id: UUID,
        **kwargs: Any,
    ) -> None:
        span = self._spans.pop(run_id, None)
        if span:
            elapsed_ms = (time.perf_counter() - self._start_times.pop(run_id, 0)) * 1000
            span.set_attribute("chain.latency_ms", round(elapsed_ms, 2))
            span.set_status(Status(StatusCode.OK))
            span.end()

    def on_chain_error(
        self,
        error: Union[Exception, KeyboardInterrupt],
        *,
        run_id: UUID,
        **kwargs: Any,
    ) -> None:
        span = self._spans.pop(run_id, None)
        if span:
            span.set_status(Status(StatusCode.ERROR, str(error)))
            span.record_exception(error)
            span.end()
        self._start_times.pop(run_id, None)

    # -------------------------------------------------------------------------
    # Helpers
    # -------------------------------------------------------------------------

    def _base_attrs(
        self, serialized: dict[str, Any], parent_run_id: Optional[UUID]
    ) -> dict[str, Any]:
        attrs: dict[str, Any] = {
            "service.name": self.config.service_name,
        }
        if parent_run_id:
            attrs["langchain.parent_run_id"] = str(parent_run_id)
        return attrs

A few decisions in this implementation deserve explanation. The BatchSpanProcessor is non-negotiable for production: it buffers spans and exports them in a background thread, so the callback method returns immediately without waiting on network I/O. The synchronous SimpleSpanProcessor would add collector round-trip latency to every LLM call — that is the primary performance footgun in callback handler implementations.

The _spans dict maps run_id to an open span. This is the bridge between the start hook (where the span is opened) and the end hook (where it is closed). The pattern is identical across LLM, tool, and chain pairs. Keep this dict and keep it clean: always pop on end or error, never leave spans open.

What to Trace and Why

Not everything is worth capturing. The following signals have direct diagnostic or economic value in production:

Token usage per LLM call. This is the primary cost signal. Without it, you cannot attribute spend to specific chains, users, or prompt versions. When the model wrapper surfaces usage data, LLMResult.llm_output["token_usage"] is where you typically read prompt tokens, completion tokens, and total tokens. Aggregate by chain name, model, and prompt version. For the full cost attribution architecture beyond individual spans, see LLM cost observability: token-level attribution and spend patterns that signal trouble.

Latency at each layer. The callback system gives you start and end events at three levels: LLM, tool, and chain. Measure latency at all three. A slow chain with fast individual LLM calls indicates tool call overhead or inter-step processing. A slow LLM call with normal completion token count indicates model API latency, not your code.

Prompt version. When you change a prompt template, you need to know whether the change improved or degraded output quality. Attaching a version string to every LLM span — passed through run_metadata or hardcoded in the handler config — lets you filter traces by version in any OTel-compatible backend without a dedicated prompt registry. For the full treatment of prompt drift detection and the metrics that predict quality degradation before it reaches users, see prompt observability: versioning, drift detection, and the metrics that matter.

Chain execution path. For agent systems, the parent_run_id chain reconstructs which chains were executed, in what order, and how long each took. This is the trace you need when a user reports an unexpected response and you need to understand the execution path that produced it. The AI observability data model covers what queries this data supports.

Tool call payloads (with caution). Tool inputs and outputs are high-value debugging data, but they frequently contain PII or sensitive business data. The implementation above truncates at 500 characters. In regulated environments, consider hashing or redacting before attaching to spans — the callback gives you the full string, but you control what goes into the span attribute.

Warning: Do not capture full prompt text in spans that export to a third-party OTel collector without auditing the prompt content first. Prompts often contain user-supplied data, retrieved documents, or intermediate agent state that may include PII. The capture_prompts: false flag in TraceConfig disables prompt capture entirely. For compliance environments, set it to false and log prompts separately to a store with appropriate access controls.

Attaching Per-Request Context

Global handlers capture all invocations, but production systems need per-request context: user ID, session ID, tenant ID, request source. The right pattern is to create a request-scoped handler instance and pass it via the callbacks parameter:

from langchain_core.runnables.config import RunnableConfig

def handle_request(user_id: str, query: str, chain) -> str:
    # Extend base config with request-scoped metadata.
    request_config = TraceConfig(
        prompt_version="v2.1.0",
        service_name="langchain-agent",
        otlp_endpoint="http://otel-collector:4317",
    )
    handler = OTelCallbackHandler(request_config)

    # Attach user_id as a span attribute by subclassing or monkey-patching,
    # or use baggage propagation if your OTel setup supports it.
    result = chain.invoke(
        {"query": query},
        config=RunnableConfig(callbacks=[handler], metadata={"user_id": user_id}),
    )
    return result

The metadata dict on RunnableConfig is passed through to callback hooks via kwargs["metadata"]. You can read it inside on_chain_start or on_llm_start and attach it to the span. This keeps user context in the trace without requiring a global context variable.

Avoiding Vendor Lock-In in Practice

The handler above has zero vendor-specific imports. It depends on opentelemetry-sdk and opentelemetry-exporter-otlp, both CNCF standards. The exporter endpoint is configuration — you point it at Jaeger, Grafana Tempo, Honeycomb, or your own ClickHouse table by changing a URL. Moving from one compatible backend to another should be a configuration change rather than a handler rewrite.

Contrast this with LangChainTracer from the LangSmith SDK: it imports from langsmith, writes to LangSmith’s ingestion endpoint, uses LangSmith’s trace schema, and requires a LangSmith API key. If LangSmith changes their SDK interface, raises prices, or has an outage, your observability pipeline breaks. The LangSmith SDK version pins become a dependency constraint across your team.

This is not an argument against LangSmith during development. The UI is genuinely useful for interactive debugging. The argument is that LangSmith should be an optional consumer of your trace data, not the only store. The pattern: emit OTel spans from your custom handler to a collector, configure a forwarder or exporter that also sends to LangSmith if needed. You get both the vendor’s UI and full data ownership.

The surviving LangChain version upgrades post documents how vendor SDK dependencies compound during framework upgrades — the callback interface itself is stable across versions, but third-party handler implementations frequently break on minor LangChain releases.

Performance Impact of Callbacks

The callback invocation itself is an in-process Python method call. What matters is the handler body. The three patterns that add meaningful latency:

Synchronous network calls in the handler. Calling a remote API, flushing to a database, or pushing to a queue synchronously in the handler body adds that call’s latency to the chain’s critical path. Use BatchSpanProcessor for OTel, or push to an in-process queue and drain it in a background thread.
String operations on large payloads. If you are logging 32K-token prompts, the string operations inside the handler can be measurable. The max_prompt_chars truncation in TraceConfig controls this. For context window economics at scale, see the context window economics post.
Span dict operations on high-frequency chains. If a chain invokes hundreds of sub-chains in rapid succession (e.g., batch processing), the _spans dict grows and shrinks rapidly. The dict operations are O(1) but at sufficient volume the garbage collector pressure is measurable. In extreme cases, profile before assuming the handler is the bottleneck.

In practice, with BatchSpanProcessor and prompt truncation configured, the callback work is usually limited to local method execution and a queue append. For most agent architectures, remote model latency, tool latency, and payload size dominate the callback overhead.

For debugging failures in agent systems where the trace alone is insufficient, the debugging CrewAI agent failures post covers complementary techniques for reconstructing what happened when the trace is incomplete.

Checklist Before Putting This in Production

Verify BatchSpanProcessor is configured — confirm no SimpleSpanProcessor in the exporter chain
Set capture_prompts: false or configure truncation to max_prompt_chars before enabling prompt logging in production
Confirm prompt version strings are attached to all LLM spans — verify by querying for spans without the llm.prompt_version attribute
Test error paths: introduce a deliberate LLM error and verify the span closes with ERROR status and the exception is recorded
Confirm the _spans dict does not grow unboundedly — add monitoring on its length in high-throughput systems
Validate parent_run_id linkage by running a multi-step chain and confirming the trace reconstructs correctly in your OTel backend
Review what data flows through tool spans for PII before enabling tool output logging in environments with user-supplied data

What This Gives You That Vendor Handlers Do Not

A custom handler gives you three things a vendor handler cannot:

First, full control over the data pipeline. You decide what gets captured, what gets redacted, where it goes, and how long it is retained. If your data governance policy requires that prompts containing user data never leave your VPC, you enforce that in the handler body.

Second, composability. You can run multiple handlers simultaneously — one that writes to OTel, one that writes latency metrics to Prometheus, one that publishes to an internal audit log. Vendor handlers are designed for one destination.

Third, a cleaner upgrade boundary across the LangChain release cycle. The BaseCallbackHandler interface is a narrower dependency than a full vendor handler SDK. You still need version-pin tests around callback signatures, but your instrumentation layer is easier to adapt when it only depends on the framework hooks you use directly.

The callback architecture is the right layer to instrument. It fires on the semantically meaningful events in LangChain’s execution model — not at the HTTP request level, not at the Python function call level, but at the points where an LLM was called, a tool was used, a chain completed. That is the data model you need to answer the questions that matter in production.

What is the LangChain callback system and how does it work?

LangChain's callback system is an event hook architecture that fires on every significant event in a chain or agent run: LLM start, LLM end, tool start, tool end, chain start, chain end, and on errors at each level. Callback handlers implement these hooks as Python methods, and LangChain invokes them automatically. You can attach handlers globally (via CallbackManager) or per-run (via the callbacks parameter). Multiple handlers can be active simultaneously, so you can ship to OpenTelemetry and a local log store at the same time.

Does building a custom callback handler meaningfully impact production latency?

It depends on what the handler does. The callback invocation itself is an in-process Python method call; the meaningful cost usually comes from the handler body. If you synchronously flush a span to a remote collector on every LLM end event, you add that network round-trip to the chain's critical path. The correct pattern is to use background threads or async exporters so the handler returns immediately. With a BatchSpanProcessor configured correctly, the overhead is usually limited to local method work and a queue append rather than remote I/O.

Can LangChain callbacks capture prompt content, or only metadata?

Callback hooks expose the payloads LangChain passes into each event. The on_llm_start hook receives the list of prompts sent to the model. The on_llm_end hook receives the LLMResult object, which can include generated text, provider-reported token usage, and model metadata. The on_chat_model_start hook receives the list of message objects. You can log, hash, truncate, or redact this content inside the handler, subject to what each model wrapper and provider response makes available.

How do you version prompts in a LangChain callback system without a third-party platform?

Attach prompt version metadata as a tag or attribute on the span. The standard pattern is to define a PromptVersion enum or dataclass in your codebase, pass the version identifier through the run_metadata dict when invoking the chain, and read it inside on_llm_start to attach it to the span. When you change a prompt template, increment the version. This lets you filter traces by prompt version in any OTel-compatible backend (Jaeger, Grafana Tempo, Honeycomb, your own ClickHouse table) without depending on a vendor's prompt registry.

The decision rule

If you are building LangChain-based agents for production, instrument custom tracing, token cost attribution, prompt versioning, and vendor-independent trace storage before the system becomes hard to inspect. The Enterprise Agentic Assessment Kit includes an observability instrumentation checklist alongside production-readiness criteria for callback architecture.

LangChain Callback Architecture: Building Production Observability Without Third-Party Lock-In

How the Callback System Actually Works

Comparing Observability Approaches

Building the Custom Callback Handler

What to Trace and Why

Attaching Per-Request Context

Avoiding Vendor Lock-In in Practice

Performance Impact of Callbacks

Checklist Before Putting This in Production

What This Gives You That Vendor Handlers Do Not

What is the LangChain callback system and how does it work?

Does building a custom callback handler meaningfully impact production latency?

Can LangChain callbacks capture prompt content, or only metadata?

How do you version prompts in a LangChain callback system without a third-party platform?

The decision rule

Bring the system under review

Igor Bobriakov

AI Agents & Autonomous Systems

Codebase Analysis Agent: 30 Seconds to First Answer

Related Articles

Surviving LangChain Version Upgrades: Migration Patterns for Production Systems

HITL Engineering Patterns: Implementing LangGraph Interrupts for Production Approval Workflows

Context Engineering for Production Agents: The Discipline Replacing Prompt Engineering