How it works
A voice session is a streaming connection between your client and the Theo voice runtime. Your app:- Captures microphone audio in the browser (or on device).
- Requests a short-lived session token from Theo.
- Opens the streaming voice connection using that token and plays back Theo’s audio response as it arrives.
Getting a Voice Token
Before starting a voice session, request a session token:The token endpoint returns
systemInstruction and tools alongside the
token as a Theo platform contract. Your client MUST attach both when it
opens the live voice connection — otherwise the session starts without
the Theo persona and with no tool access.systemInstruction carries the Theo voice persona, platform context,
runtime protocol, and tool-usage guidance (plus memory and skill context
when present). The tools array includes every declared voice tool
(generate_image, generate_video, deep_research, generate_code,
generate_document, save_memory) and any installed skill tools
namespaced as skill_{slug}_{toolName}.
Voice selection is optional. Available voice identifiers are listed in
the dashboard under Settings → Voice.
Transcription (STT)
Send audio for transcription via the STT endpoint:{ "text": "transcribed text" }.
Supported formats: WebM, MP3, WAV, M4A, OGG. Max file size: 25MB.
Text-to-Speech (TTS)
Convert text to audio:Skill Tools in Voice
E.V.I. skills with tool definitions are automatically available in voice sessions. The skill bridge exposes each skill’s tools to the voice runtime as function declarations:- Tool names are namespaced as
skill_{slug}_{toolName}to avoid collisions. - Input schemas declared in the skill manifest are forwarded to the voice runtime.
- When the voice session calls a skill tool, Theo routes it to your skill’s tool executor and feeds the result back into the live session.
