Skip to content
Search ESC

The Hidden Duplex Problem in Realtime Voice Agents

2026-05-26 · 7 min read · Igor Bobriakov

A voice agent that speaks still needs to listen.

That sounds obvious until the first live test.

The agent starts answering. A person interrupts. Another person adds context. Someone says “stop.” Someone else changes the question while the agent is halfway through its response.

Now the system has a harder problem than speech generation. It needs a duplex policy: how to listen while speaking, when to yield, when to stop, when to ignore noise, and when to hand the floor back to a person.

Without that policy, voice agents become socially awkward at best and operationally risky at worst.

The Demo Hides The Duplex Problem

Most demos are polite. The user asks. The agent answers. The user waits.

Meetings are not that tidy.

People interrupt because the answer is wrong, because they already got what they needed, because a more urgent point arrived, or because the agent should not be answering at all.

If the system treats speaking as a locked state, it misses those signals. If it treats every interruption as a new command, it becomes chaotic.

The system needs to decide what kind of interruption it is hearing.

Interruption Is Not One Behavior

Interruption can mean several different things:

  • stop talking
  • correct the previous context
  • ask a follow-up
  • change topic
  • revoke consent
  • hand the question to a human
  • add a note without speaking

Those cases should not route through the same handler.

A business voice agent needs a policy layer that maps interruption types to allowed actions.

class InterruptionPolicy(BaseModel):
interruption_type: Literal["stop", "correction", "follow_up", "topic_shift", "opt_out", "human_handoff"]
stop_speech: bool
update_artifact: bool
answer_allowed: bool
requires_human: bool

Again, the important part is not the specific schema. It is the decision to make interruption behavior explicit.

Yield Rules Create Trust

The agent should not compete for the floor.

In human meetings, yielding is a social signal. It shows that the speaker understands the room. For voice agents, yielding is also a safety signal.

The system should yield when:

  • a participant starts speaking over it
  • the request touches a human-owned decision
  • context is incomplete
  • the agent is asked to stop
  • the conversation moves away from the agent’s task

A voice agent that yields cleanly feels controlled. An agent that keeps talking feels like a liability.

Human-Owned Decisions Need Harder Stops

The duplex problem becomes more serious when the interruption is about authority.

If someone asks the agent to confirm a price, approve scope, interpret contract language, or commit to delivery timing, the system should not improvise.

The correct response may be:

  • stop speaking
  • write the question into the artifact
  • identify the human owner
  • ask that owner to answer
  • mark the point for follow-up

This is why voice-agent readiness belongs next to AI agent permission design, not just next to speech model selection.

Duplex Behavior Needs Observability

If the agent behaves poorly, the team needs to know why.

Was it speaking? Did it detect interruption? Did it classify the interruption correctly? Did it stop output? Did it update the artifact? Did it miss an opt-out command?

Those events should be logged as first-class system events, not discovered by watching a recording and guessing.

The same logic applies to broader agent observability for production audits: if the system can affect a workflow, its control decisions need traces.

What To Test

A real duplex test should include:

  • interrupting the agent mid-answer
  • correcting a fact while it speaks
  • asking an unrelated follow-up
  • asking it to stop
  • asking it to leave the call
  • changing from a safe topic to a human-owned decision
  • resuming after interruption without losing the artifact thread

One smooth demo call does not prove this behavior. The system needs repeatable tests that create pressure on the turn-state layer.

The Architecture Smell

The warning sign is simple: if duplex behavior is described only in the prompt, the architecture is probably too soft.

The agent needs policy outside the prompt:

  • stop conditions
  • interruption classes
  • allowed response modes
  • opt-out handling
  • human-handoff rules
  • artifact update rules

That is what makes the system governable.

The decision rule

Do not test a real-time voice agent only on clean turn-taking. Test interruption policy, yield rules, opt-out handling, and reviewable traces before real calls. Duplex behavior is where a smooth demo becomes an operational system or a liability.

Technical Review

Bring the system under review

Send the system context, constraints, and pressure. A Principal Engineer reviews it and recommends the next step.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

About the author

Igor Bobriakov

AI Architect. Author of Production-Ready AI Agents. 15 years deploying production AI platforms and agentic systems for enterprise clients and deep-tech startups.