Build AI Phone Call Systems with Twilio and TEN Framework
The TEN Framework makes it possible to build real-time, low-latency AI phone call systems that can handle both inbound and outbound calls with natural voice conversations — all orchestrated through a single, unified pipeline.
In this tutorial, we'll show you how to create an AI assistant that can make and receive phone calls using Twilio's Voice API and TEN Framework. The best part? You get real-time speech recognition, intelligent responses, and natural text-to-speech — all working together seamlessly.
Project Structure
You don't need to set everything up from scratch — TEN Framework already provides a ready-to-use Twilio phone call example in the repository.
👉 You can find it here: voice-assistant-sip-twilio example on GitHub
The folder layout looks like this (mirroring what you'll find on GitHub):
.
├── tenapp/
│ └── ten_packages/
│ └── extension/
│ └── main_python/
│ ├── extension.py → MainControlExtension (call orchestration)
│ ├── server.py → Twilio calls management
│ └── config.py → Configuration management
├── frontend/ → Next.js web interface
├── server/ → Standalone server to manage ten process
└── property.json → TEN Framework configurationThis example shows how to implement both inbound and outbound call handling while reusing existing ASR, LLM, and TTS extensions.
Getting Started
We recommend following the official Getting Started guide for the basic setup steps (installations, API keys, environment, Docker, etc.).
cd agents/examples/voice-assistant-sip-twilio
task installThis ensures you're running the Twilio-integrated version with full phone call capabilities.
Required API Keys
Before you start, you'll need accounts and API keys from these services:
Twilio Account
Get credentials from Twilio Console:
TWILIO_ACCOUNT_SID- Your Twilio Account SIDTWILIO_AUTH_TOKEN- Your Twilio Auth TokenTWILIO_FROM_NUMBER- Your Twilio phone numberTWILIO_PUBLIC_SERVER_URL- Your public server URL for webhooks
Deepgram Account
Get credentials from Deepgram Console:
DEEPGRAM_API_KEY- Your Deepgram API key for speech-to-text
OpenAI Account
Get credentials from OpenAI Platform:
OPENAI_API_KEY- Your OpenAI API key for language model
ElevenLabs Account
Get credentials from ElevenLabs:
ELEVENLABS_TTS_KEY- Your ElevenLabs API key for text-to-speech
Test It Out
Now that you've set everything up:
- Follow the Getting Started guide
- Prepare your environment variables by creating a
.envfile with all the required API keys:
# Twilio (required for call handling)
TWILIO_ACCOUNT_SID=your_twilio_account_sid_here
TWILIO_AUTH_TOKEN=your_twilio_auth_token_here
TWILIO_FROM_NUMBER=+1234567890
TWILIO_PUBLIC_SERVER_URL=https://your-domain.com
# Deepgram (required for speech-to-text)
DEEPGRAM_API_KEY=your_deepgram_api_key_here
# OpenAI (required for language model)
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-4
# ElevenLabs (required for text-to-speech)
ELEVENLABS_TTS_KEY=your_elevenlabs_api_key_here
# Ngrok (required for local development)
NGROK_AUTHTOKEN=your_ngrok_auth_token_here- Set up ngrok for local development (required for Twilio webhooks):
# Install ngrok if you haven't already
# Download from https://ngrok.com/download or use package manager
# Authenticate ngrok with your auth token (from your .env file)
ngrok config add-authtoken $NGROK_AUTHTOKEN- Use the Twilio voice assistant:
cd agents/examples/voice-assistant-sip-twilio
task install- Start the system:
task run- Access the web interface at http://localhost:3000
- Use the frontend to make outbound calls or handle inbound calls
- Start talking — your AI phone system will handle the entire conversation!
✨ That's it — you now have a working AI phone call system powered by Twilio and TEN Framework!
Outbound Call Flow: From API to Audio
Let's trace through a complete outbound call to understand how the system works end-to-end:
API Request to Create Call
When you make an outbound call through the frontend or API:
# From: tenapp/ten_packages/extension/main_python/server.py
@self.app.post("/api/call")
async def create_call(request: Request):
"""Create a new outbound call"""
body = await request.json()
phone_number = body.get("phone_number")
message = body.get("message", "Hello from Twilio!")
# Create TwiML response with media stream
twiml_response = VoiceResponse()
# Configure WebSocket URL for real-time audio streaming
ws_protocol = "wss" if self.config.twilio_use_wss else "ws"
media_ws_url = f"{ws_protocol}://{self.config.twilio_public_server_url}/media"
connect = twiml_response.connect()
connect.stream(url=media_ws_url) # This tells Twilio to connect to our WebSocket
twiml_response.append(connect)
# Create the call via Twilio API
call = self.twilio_client.calls.create(
to=phone_number,
from_=self.config.twilio_from_number,
twiml=str(twiml_response)
)👉 Key Point: The TwiML includes a <Stream> instruction that tells Twilio to establish a WebSocket connection to our server for real-time audio streaming.
WebSocket Connection Establishment
When Twilio receives the call, it immediately tries to connect to our WebSocket endpoint:
# From: tenapp/ten_packages/extension/main_python/server.py
@self.app.websocket("/media")
async def websocket_endpoint(websocket: WebSocket):
"""WebSocket endpoint for Twilio media streaming"""
# Accept the connection immediately
await websocket.accept()
# Send confirmation message
await websocket.send_text(
'{"type": "connected", "message": "WebSocket connection established"}'
)
# Store the WebSocket connection for this call
call_sid = None
while True:
data = await websocket.receive_text()
message = json.loads(data)
if message.get("event") == "start":
# Extract call SID and store WebSocket connection
call_sid = message.get("start", {}).get("callSid", "")
self.active_call_sessions[call_sid]["websocket"] = websocket👉 Key Point: Twilio establishes a persistent WebSocket connection that will carry all audio data in both directions.
Receiving Phone Audio
When the person answers the phone and starts speaking, Twilio sends audio data through the WebSocket:
# From: tenapp/ten_packages/extension/main_python/server.py
if message.get("event") == "media":
# Extract base64-encoded audio payload
audio_payload = message.get("media", {}).get("payload", "")
stream_sid = message.get("streamSid", "")
if audio_payload and call_sid:
# Forward audio to TEN framework for processing
await self.extension_instance._forward_audio_to_ten(
audio_payload, stream_sid
)The audio then flows through the TEN framework:
# From: tenapp/ten_packages/extension/main_python/extension.py
async def _forward_audio_to_ten(self, audio_payload: str, call_sid: str):
"""Forward audio data to TEN framework and dump PCM audio"""
# Decode base64 audio data from Twilio
audio_data = base64.b64decode(audio_payload)
# Send to ASR (speech-to-text) extension
await _send_data(self.ten_env, "audio", "stt", {
"audio_data": audio_data,
"session_id": call_sid,
"stream_id": 100
})👉 Key Point: Phone audio → Twilio WebSocket → Base64 decode → TEN Framework → ASR processing.
Sending Response Audio Back
When the AI generates a response, it flows back through the system:
# From: tenapp/ten_packages/extension/main_python/extension.py
async def send_audio_to_twilio(self, audio_data: bytes, call_sid: str):
"""Send audio data to Twilio via WebSocket"""
# Get the WebSocket connection for this call
websocket = self.server_instance.active_call_sessions[call_sid].get("websocket")
# Downsample audio from 16000 Hz to 8000 Hz for Twilio
source_rate = 16000 # TTS generated audio sample rate
target_rate = 8000 # Twilio required sample rate
downsampled_audio = audioop.ratecv(
audio_data, 2, 1, source_rate, target_rate, None
)[0]
# Encode as base64 for Twilio
encoded_audio = base64.b64encode(downsampled_audio).decode('utf-8')
# Send to Twilio via WebSocket
message = {
"event": "media",
"streamSid": call_sid,
"media": {
"payload": encoded_audio
}
}
await websocket.send_text(json.dumps(message))👉 Key Point: TTS audio → Downsample to 8kHz → Base64 encode → WebSocket → Twilio → Phone speaker.
Complete Audio Loop
The complete flow creates a real-time conversation loop:
- Phone → Twilio: Person speaks into phone
- Twilio → WebSocket: Audio sent as base64-encoded chunks
- WebSocket → TEN: Audio forwarded to ASR extension
- ASR → LLM: Speech converted to text, sent to AI
- LLM → TTS: AI response converted to speech
- TTS → WebSocket: Audio sent back to Twilio
- WebSocket → Twilio → Phone: AI voice plays through phone
All of this happens in real-time with minimal latency, creating natural phone conversations.