Build a Real-Time Transcription with TEN
Need a quick way to turn live speech into readable transcripts? The TEN transcription example pairs Agora RTC, a speech-to-text add-on, and a lightweight LLM polish pass to deliver clean captions in a single flow. This post breaks down the moving parts and shows how to run—and extend—the sample yourself.
Why TEN? The runtime abstracts graph orchestration, so you can swap providers or add post-processing without rewriting your whole stack.
Architecture at a Glance
- Agora ingress captures microphone audio and pushes PCM frames and
stream-message
data into the graph. - Stream ID adapter normalises Agora payload IDs so downstream nodes stay in sync with the UI.
- Speech-to-Text (STT) converts audio frames into partial and final transcripts.
- Main control fans out ASR events, queues LLM prompts for optional clean-up, and forwards results to the collector.
- Message collector batches payloads and emits them over Agora's data channel.
- Next.js UI starts the graph, joins the channel, and renders raw vs polished text in real time.
Alongside the runtime graph you also get:
ten_packages/extension/main_python
: the orchestration extension that implements the transcription flow.web/
: a minimal Next.js client that starts the graph, joins Agora, and displays transcripts.manifest.json
: the app manifest that pins runtime and extension versions.
Graph Wiring
The graph is declared in ai_agents/agents/examples/transcription/property.json
and wires the whole loop together. A trimmed view highlights the core nodes:
Connections bind each stage: Agora PCM frames flow through the streamid_adapter
into the STT node; the resulting asr_result
data stream triggers main_control
; and the message collector publishes concatenated transcripts back into Agora so the UI can subscribe without extra transports.
Inside main_control
The Python extension (ten_packages/extension/main_python/extension.py
) is where ASR and LLM events are orchestrated. It registers handlers for each agent event, forwards partial transcripts immediately, and optionally calls the LLM when an utterance is final:
When the LLM streams back, _send_transcript
republishes the polished text as an assistant
role. Setting enable_llm_correction
to false
in the graph properties skips the clean-up pass and surfaces raw ASR output only.
Event Queues and Backpressure
main_control
delegates most orchestration to the lightweight Agent
(agent/agent.py
). Two asyncio queues guarantee ordered delivery and isolate ASR vs LLM work:
Because LLM responses are processed in their own queue, the extension can cancel or flush in-flight prompts whenever a new final transcript arrives. Tool registration events (tool_register
) also pass through the same dispatcher, so the graph can expose extra functions without touching the core logic.
LLM Correction Loop
agent/llm_exec.py
wraps streaming chat completions and handles reasoning traces. Each transcript chunk is queued as a user
message, and the class manages cancellation plus context updates:
Reasoning deltas stream back as a separate channel, letting the UI present chain-of-thought data or hide it depending on the audience. If your model already returns clean text, disable correction and the loop simply passes through raw ASR events.
Message Flow to the Browser
message_collector2
emits segmented payloads using Agora stream-message
. The Next.js client (web/src/app/page.tsx
) reconstructs them, decodes the base64 payload, and pushes messages into local state:
Because everything flows through Agora, you can ship both audio and transcript data across the same connection—ideal for browsers that can’t open extra WebSocket ports.
Running the Example
-
Configure credentials in the TEN repo root
.env
:AGORA_APP_ID
,AGORA_APP_CERTIFICATE
(if tokens required)DEEPGRAM_API_KEY
(or swap in another STT addon + credentials)OPENAI_API_KEY
plus optionalOPENAI_MODEL
,OPENAI_PROXY_URL
-
Select the agent:
-
Start the runtime and leave it running:
-
Launch the UI:
-
Visit
http://localhost:3000
, click Start, allow microphone access, and watch transcripts stream in.
When the graph spins up it stays idle until the Agora client joins, so you’re only billed for STT/LLM usage while someone is actively publishing audio.
Where to Take It Next
- Swap providers: point the STT node at Azure, Deepgram multilingual, or a self-hosted model—the graph contract stays intact.
- Add summarisation: queue an extra LLM node that occasionally summarises the conversation and stream it back as a different message type.
- Persist transcripts: mirror the collector payloads into a database or send them to a webhook alongside Agora stream metadata.
- Expose tools: register custom LLM tools (via
tool_register
) so corrections can call domain-specific APIs before presenting text.
The transcription example is intentionally lean, but every piece is production ready: the runtime graph keeps latency predictable, the Python control extension deals with post-processing, and the browser client demonstrates a simple, vendor-agnostic UI surface. Use it as a template for live captioning, meeting notes, or any workflow where polished text needs to trail the spoken word by only a heartbeat.