Diarization is the art of answering a simple question: "who said what?"
In this post we will wire up TEN Framework, Speechmatics, and a light Next.js UI so every utterance carries both the speaker ID and the transcript.
Architecture at a Glance
sequenceDiagram
participant Browser
participant TEN_Agent as TEN Agent Runtime
participant Speechmatics
Browser->>TEN_Agent: `/api/agents/start` (channel, user UID)
TEN_Agent->>Speechmatics: Streaming audio (Agora RTC upstream)
Speechmatics-->>TEN_Agent: Diarized transcript chunks
TEN_Agent-->>Browser: Agora `stream-message` (chunk_id|part|total|payload)
Browser->>Browser: Reassemble chunks → decode text → render speaker cardsKey pieces:
- TEN agent - orchestrates audio capture, Speechmatics API calls, and downstream formatting.
- Speechmatics diarization - returns transcripts with speaker labels (
{"speaker": "S1"}style metadata). - Agora RTC - carries mic audio upstream and synthesized speech downstream.
- Next.js UI - listens to TEN stream-message events, reconstructs UTF-8 payloads, and surfaces speakers with avatars.
Speechmatics Setup
You will need an API key and a diarization-enabled task.
# .env (server root)
SPEECHMATICS_API_KEY=sk_your_keyThe diarization task in graph.json (simplified):
{
"type": "speechmatics_diarization",
"config": {
"language": "en",
"diarization": {
"max_speakers": 4
}
}
}TEN Speechmatics extension already handles authentication; just expose the key.
Agent Flow: From Audio to Diarized Text
ASR Event Handler
Inside llm_exec.py (or the companion orchestrator) we listen to ASRResultEvent:
@agent_event_handler(ASRResultEvent)
async def _on_asr_result(self, event: ASRResultEvent):
speaker = event.metadata.get("speaker", "")
text = event.text or ""
final = event.final
label = self._format_speaker_label(speaker)
self.ten_env.log_info(f"[ASR] {label}: {text}")
await self._send_transcript(
role="assistant",
text=f"[{label}] {text}",
is_final=final,
timestamp=int(time.time() * 1000),
)Speechmatics returns the diarization tag in metadata (speaker, channel). We convert it into consistent labels (for example, [S1], [Elliot]).
Packaging Chunks for the Frontend
Chunks are base64 encoded and streamed via Agora data channels:
chunk = f"{message_id}|{part_index}|{total_parts}|{payload_b64}"
await self._send_to_client(chunk)This keeps large payloads arriving in ordered slices without breaking real-time delivery.
Frontend: Reassembling & Rendering
Chunk Cache
In src/app/page.tsx, store each message in a Map keyed by part_index. Duplicate parts overwrite the existing entry:
const cacheRef = useRef<Map<string, Map<number, TextChunk>>>(new Map());
messageCache.set(part_index, chunk);
if (messageCache.size === total_parts) {
const payloadRaw = reconstructMessage(Array.from(messageCache.values()));
const payload = JSON.parse(base64ToUtf8(payloadRaw));
// ...
cache.delete(message_id);
}Decoding the Payload
Each chunk contains the raw transcript plus diarization hints:
const decoded = streamDecoder.decode(stream);
const [messageId, partStr, totalStr, content] = parseHeader(decoded);parseHeader locates the first three | characters so payloads can safely include additional delimiters.
Speaker Extraction
Enforce [Speaker] Utterance format and split in the UI:
const extractSpeaker = (raw: string) => {
const match = raw.match(/^\[([^\]]+)\]\s*/);
return {
speaker: match ? match[1] : undefined,
text: match ? raw.slice(match[0].length) : raw,
};
};The UI then color codes cards, highlights the active speaker, and displays "...listening" states while interim transcripts arrive.
Environment & Routing
Next.js middleware rewrites API calls to the TEN agent without losing query strings:
if (pathname.startsWith("/api/agents/")) {
const target = new URL(AGENT_SERVER_URL);
target.pathname = pathname.replace("/api/agents/", "/");
target.search = req.nextUrl.search;
return NextResponse.rewrite(target);
}The frontend loads its configuration from .env (generated from .env.example) and uses genUUID() for request IDs to keep logs consistent.
UX Touches
- Rolling gallery shows a 3D carousel of recognized speakers and updates their status to "Speaking" or "Waiting."
- Spotlight animations use CSS entry transitions to draw attention to new speaker cards.
- Active speaker tracking drives accent colors and keeps the latest speaker visible with
scrollIntoView.
Hardening Tips
| Concern | Mitigation |
|---|---|
| Duplicate chunks | Map<number, chunk> deduplicates by part_index. |
| Partial payload flush | Cache clears on stop, and messages are removed after full reconstruction. |
| Unknown speaker labels | _normalize_label() collapses variants (for example, "Musk" versus "mask"). |
| Query string loss in middleware | Rewrites use URL objects and preserve the search string. |
| API responses | normalizeApiResponse() centralizes the ok/code/msg/data logic for consistent error handling. |
Running Everything
# Backend: start the TEN agent
task use AGENT=speechmatics-diarization
task run
# Frontend: configure and run
cd agents/examples/speechmatics-diarization/frontend
cp .env.example .env
pnpm install
pnpm devVisit http://localhost:3000, click Start, and speak into the mic. Speechmatics will diarize the audio, the TEN agent will relay speaker labels, and the UI will show each person's words in context.
Wrapping Up
With TEN, Speechmatics, and Agora working together, diarization becomes a straightforward pipeline:
- Capture real-time audio.
- Hand it off to a diarization capable ASR service.
- Normalize speaker metadata.
- Stream structured chunks back to the browser.
- Reconstruct and surface those chunks beautifully.
The result is a responsive assistant that not only hears the conversation but also remembers who said what.