DeepSeek R1 0528, an updated version of the state-of-the-art DeepSeek R1 model, is now available. Try it now!

Blog
Voice Agents

Build customizable, real-time voice agents with Fireworks Voice Agent Platform (Beta)

Announcing Voice Agent Platform Beta

Background - Voice AI Components

Voice agents represent one of the most exciting frontiers in AI. However, voice agents have been difficult to productionize, owing to high costs, robotic quality and unreliable performance.

At Fireworks, we’ve gotten a front row seat to voice agent development by offering the market’s fastest LLM inference and audio transcription (as measured by Artificial Analysis). We observe customers manually constructing voice agents from 3 or more components. Customers use these “multi-component” or “cascaded” approaches, as opposed to end-to-end voice solutions like OpenAI’s real-time API, for improved customizability and reliability.

Artificial analysis comparison of fastest transcription

However, these cascaded voice agents struggle with:

  • Quality - ASR must intelligently handle accents and custom vocabulary, like product names. LLM must be able to manage tools and follow complex instructions.
  • Latency - Typical cascaded systems require 1+ seconds to respond to inputs
  • Cost and complexity - Developers manually string together disparate solutions, leading to engineering complexity while also requiring payment of 3+ bills

To address these issues, Fireworks is creating a platform for co-located, vertically integrated voice agents. Fireworks voice agents, now in beta, feature:

  • Blazing-fast speeds (<500 ms): Fireworks co-locates and applies optimization techniques to the market’s fastest LLM and audio components to make them even faster together
  • Simple management: Instead of dealing with 3+ vendors, Fireworks provides a single place to coordinate and scale
  • Configurable, comprehensive quality: Fireworks built customizable voice agent infrastructure from end to end, including our own streaming ASR and TTS. Any LLM on the Fireworks platform can be used to power voice agents, including fine-tuned LLMs.
Comparing cascaded versus fireworks voice agents

Read more to see how our approach can elevate your voice agent experience.

Blazing-fast (<500 ms) speeds

Typical cascaded approaches require calls to be made to 3 separate services, leading to long, multi-second response times. Each call traditionally happens sequentially, where you wait for each part of the pipeline to finish and return to the client before proceeding. 
In contrast, Fireworks co-locates all components and streams information server-side, so that time is minimized between separate calls. We also use techniques like pre-emptively generating responses and proactively streaming between components so that computation occurs in parallel. Together, these results lead to sub 500ms response times for human-like interactions.

Comparing multiple round trips to co-located services

Accurate ASR

We’re powering our voice agent platform with Fireworks’ new V2 version of our streaming speech transcription for higher accuracy, especially in the presence of background noise. We tested on several benchmarks, including Microsoft’s Scalable Noisy Speech Dataset and the Librispeech dataset. We see Fireworks streaming transcription outperform Deepgram Nova across these benchmarks, as shown in the below results. The new ASR service is also available standalone, where you can try it via docs or our UI playground.

Crisp, steerable TTS

Conversations require realistic voices that accurately pronounce custom vocabulary like product names. Fireworks TTS enable users to simply prompt a voice agent to use a specific pronunciation. Users can specify pronunciation through the international phonetic alphabet (IPA). For example, to get the British pronunciation of “Nike”, we add the prompt:

“Every time you would have said Nike, do not say Nike, instead use the IPA representation of the pronunciation. <ipa>nˈaɪk</ipa>.”

Check out this demo to see Fireworks TTS can switch between pronouncing “GIF” like “jif” and “gif”.

Fireworks AI Voice Agent Platform (Beta) - Pronunciation Demo

Beyond pronunciation, the right voice depends on the individual use case. Fireworks provides 40 different voice options with our TTS model and 48 voices across 8 languages through the open-source Kokoro model. We also enable customers to fine-tune voice models for their specific needs. Both models are fully commercially usable - Fireworks' TTS has been trained exclusively on commercially usable available data and Kokoro is licensed under Apache 2.0.

Intelligent LLM and conversational experience

Finally, voice agents require advanced LLM capabilities to follow complex instructions and take action with tools. Any LLM on the Fireworks platform can be used to power voice agents, including LLMs fine-tuned with reinforcement learning and/or supervised fine-tuning.

Fireworks has also built a medley of other conversation quality improvements, like end-of-utterance detection and interruption handling for natural conversations out-of-the-box. Function calling is also natively supported, where our system is able to dispatch a function call and use it to respond naturally to the user. Check out this demo of a voice agent using tool calling in action.

Fireworks AI Voice Agent Platform - Function Calling Demo

What’s more, our stack is fully customizable end-to-end. Each of these components can be tuned and adjusted based on use cases. Larger changes are also possible like model tuning/substitutions or architecture changes.

Get started

You can try out an initial prompt at our beta UI at fireworks.fm and view docs here. The provided endpoint is free and rate limited. It’s intended to be used to demo Fireworks’ voice agent capabilities. To get full access and customization ability or to apply to be a design partner, fill out this < 1 min form. The Fireworks team will work directly with design partners to optimize a voice agent stack for optimal quality and speed. We can’t wait to see what you build!