Realtime V2V / MLLM Agent (`extension.py`)

The file extension.py is the entry point of the realtime voice-to-voice / multimodal-LLM app. It consumes Server→Client (S2C) events (transcripts, tool calls, interrupts, user join/leave), normalizes them into typed events, and routes them to the agent. It also provides Client→Server (C2S) primitives to set context, send messages, return tool results, and trigger responses.

Quick File Layout

.
├── extension.py      → Main extension: routing + context + trigger + interrupt
├── config.py         → Runtime config (e.g., greeting)
├── helper.py         → Cmd/Data send helpers
└── agent/
    ├── agent.py      → Event queue, function call handling
    └── events.py     → Typed events (UserJoined, Transcripts, FunctionCall, Interrupt, etc.)

Audio data is not handled here. RTC audio is sent directly to MLLM via TEN graph connections. extension.py only reacts to the S2C transcript events generated by the MLLM server.

Architecture Overview

Event Routing

Unlike decorator-based handlers, this extension uses a single on_data method with a match/case to dispatch events.

async def on_data(self, ten_env: AsyncTenEnv, data: Data):
    event = parse_event(data)  # converted to a typed AgentEvent

    match event:
        case UserJoinedEvent():
            self._rtc_user_count += 1
            await self._greeting_if_ready()

        case UserLeftEvent():
            self._rtc_user_count -= 1

        case ToolRegisterEvent():
            await self.agent.register_tool(event.tool, event.source)

        case FunctionCallEvent():
            await self.agent.call_tool(event.call_id, event.function_name, event.arguments)

        case InputTranscriptEvent():
            self.current_metadata = {"session_id": event.metadata.get("session_id", "100")}
            self.session_ready = True
            await self._greeting_if_ready()

        case OutputTranscriptEvent():
            await self._send_transcript("assistant", event.text, event.is_final, event.stream_id)

        case ServerInterruptEvent():
            await self._interrupt()

        case _:
            self.ten_env.log_warn(f"[MainControlExtension] Unhandled event: {event}")

S2C events handled: UserJoinedEvent / UserLeftEvent track users. InputTranscriptEvent captures user speech text from RTC audio. OutputTranscriptEvent sends assistant response. ToolRegisterEvent registers new tools. FunctionCallEvent handles LLM tool requests. ServerInterruptEvent stops ongoing output.

C2S Primitives (Sending to MLLM)

extension.py provides simple methods for sending instructions to the MLLM server:

Set context

async def _set_context_messages(self, messages: list[MLLMClientMessageItem]):
    await _send_data(
        self.ten_env,
        DATA_MLLM_IN_SET_MESSAGE_CONTEXT,
        "v2v",
        MLLMClientSetMessageContext(messages=messages).model_dump(),
    )

Send a message

async def _send_message_item(self, message: MLLMClientMessageItem):
    await _send_data(
        self.ten_env,
        DATA_MLLM_IN_SEND_MESSAGE_ITEM,
        "v2v",
        MLLMClientSendMessageItem(item=message).model_dump(),
    )

Trigger a response

async def _send_create_response(self):
    await _send_data(
        self.ten_env,
        DATA_MLLM_IN_CREATE_RESPONSE,
        "v2v",
        MLLMClientCreateResponse().model_dump(),
    )

Send function output

(Sent from agent.py after handling a FunctionCallEvent)

await _send_data(
    self.ten_env,
    DATA_MLLM_IN_FUNCTION_CALL_OUTPUT,
    "v2v",
    MLLMClientFunctionCallOutput(
        output=result,
        call_id=call_id,
    ).model_dump(),
)

Common Implementation Patterns

Greeting Recipe

The greeting is handled in _greeting_if_ready() and triggered when the first user joins and session is ready:

async def _greeting_if_ready(self):
    if self._rtc_user_count == 1 and self.config.greeting and self.session_ready:
        await self._send_message_item(
            MLLMClientMessageItem(
                role="user",
                content=f"say {self.config.greeting} to me",
            )
        )
        await self._send_create_response()

This ensures the assistant greets the user automatically.

Function Call Handling

S2C: FunctionCallEvent means the model requests a tool. Agent: executes the tool logic. C2S: return result with DATA_MLLM_IN_FUNCTION_CALL_OUTPUT and the same call_id. The model may continue its response afterwards.

Interruption

async def _interrupt(self):
    await _send_cmd(self.ten_env, "flush", "agora_rtc")

Triggered on ServerInterruptEvent. Stops RTC playback/streaming.

Supporting Files

agent.py queues events, executes tools, sends function outputs. events.py defines typed events: UserJoinedEvent, UserLeftEvent, InputTranscriptEvent, OutputTranscriptEvent, ToolRegisterEvent, FunctionCallEvent, ServerInterruptEvent. helper.py provides wrappers for _send_cmd, _send_data. config.py holds config like greeting. Graph (property.json) wires RTC audio → MLLM.

Events/API Summary

Direction	Event / Channel	Purpose
S2C	`UserJoinedEvent`	Track users; trigger greeting
S2C	`UserLeftEvent`	Track users leaving
S2C	`InputTranscriptEvent`	User speech text from RTC audio
S2C	`OutputTranscriptEvent`	Assistant response text
S2C	`ToolRegisterEvent`	Register new tool
S2C	`FunctionCallEvent`	Model requests tool
S2C	`ServerInterruptEvent`	Stop ongoing output
C2S	`_set_context_messages([...])`	Provide system/dev/user context
C2S	`_send_message_item(...)`	Send user/developer message
C2S	`_send_create_response()`	Trigger assistant response
C2S	Function output (from agent)	Return tool results

For detailed event documentation and parameters, see the API Reference.

Python - Realtime V2V Main