Real-Time Voice AI Backend (Custom LLM over WebSocket)
Overview
A FastAPI backend that serves a custom LLM to a real-time conversational-voice platform (Retell) over a WebSocket, with optional inbound/outbound phone calls via Twilio. It streams OpenAI completions back to the voice agent low-latency and supports LLM function calling.
Why It Exists
To explore building voice AI agents on our own LLM logic rather than a black-box, plugging a custom model endpoint into a managed speech/telephony layer. This validates the architecture for latency-sensitive, tool-using voice assistants.
What We Built
A websocket server (server.py/server2.py) exposing an /llm-websocket endpoint that Retell connects to; LLM orchestration modules (llm.py, llm_in_connection.py, llm_with_func_calling.py) covering plain streaming and function-calling variants; and a Twilio integration (twilio_server.py, call.py) for placing and receiving phone calls. Local exposure is handled through ngrok per the README. Pinned stack includes fastapi, uvicorn, openai, retell-sdk, and twilio.
Technologies & Approach
FastAPI + Uvicorn for an async WebSocket server, the OpenAI SDK for streaming completions, the Retell SDK for the voice agent contract, and Twilio for telephony. Streaming and function-calling code paths are separated to compare approaches.
Outcome / Impact
Proved that a custom, function-calling LLM backend can be driven by a managed voice platform and a telephony provider end-to-end. The README is candid that the OpenAI (vs. Azure OpenAI) endpoint introduces variable latency, a useful, honest finding for productionizing voice AI. Archived as R&D.
Capabilities Demonstrated
- Real-time streaming voice AI over WebSockets
- Custom LLM integration with managed conversational-voice platforms
- LLM function/tool calling in a live agent loop
- Telephony integration (Twilio) for inbound/outbound calls