Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality
By Fireworks AI Team|1/23/2025
Loading...
Today, we’re a new streaming speech-to-text API for real-time use cases like voice agents and live captioning!
Speed: Achieves latency of 300ms end-to-end for 16kHz mono PCM audio
Accuracy: Near-whisper-level accuracy within 3% WER of Whisper v3-large.
Cost-efficiency: Customers can try it free for 2 weeks on the serverless tier, then pricing will be $0.0032 per audio minute, less than half the price of competing services. Transcribe a day of audio for less than $5!
Why Streaming Transcription
Audio is a critical way to connect with AI applications. Much of the world’s information exists as audio and speaking is a natural way to interact with AI. We heard numerous customer requests from LLM users for ways to transcribe data to enable LLM reasoning over audio data like lecture notes and voice chats.
This led us to recently release the market’s fastest Whisper (as measured by Artificial Analysis), transcribing 1 hour of audio in 4 seconds. This service was only available for async transcription, where you upload an audio file and get back a complete transcription. Async transcription is great for use cases like summarizing podcasts or meeting recordings.
However, some use cases need transcription as soon as words are spoken. Call center agents need immediate transcripts to guide ongoing calls. Live broadcasts benefit from up-to-the-second captions that keep viewers engaged. Delayed responses can limit automation and slow user interactions. We’ve built a faster, streaming solution to unlock these use cases.
Streaming Audio API
How it works: Customers establish a WebSocket connection that streams audio chunks of 50-500ms intervals. The API receives these chunks, transcribes them in near real time, and returns incremental text segments. Applications can display intermediate transcripts, drive automation, and integrate data into downstream processes. This continuous feedback loop keeps conversations fluid and ensures minimal disruption for users.
Check out this live demo of our streaming UI to see it in action
Speed: The new streaming API achieves up to 300ms end-to-end user-facing latency and handles raw PCM audio at 16kHz in mono. We’ve built a custom audio serving stack from the ground up from years of experience building Pytorch. Our audio stack uses optimizations like built-in voice activity detection (VAD) to efficiently handle sparse speech audio and optimized workload handling to manage numerous concurrent connections at low-latency.
Quality: Our service achieves accuracy within 3% WER of Whisper v3-large. We use a Whisper-compatible model, so you can contact us to serve your fine-tuned Whisper model. Users also have the option to use streaming audio alongside a voice activity detection (VAD) model, to remove silent intervals and improve transcription accuracy.
Production-readiness: Fireworks handles production AI serving at scale companies like Cursor, Uber and Doordash. Handle your production traffic with Fireworks - serverless customers receive a quota of 50 concurrent streams (10 concurrent streams during the free trial period).
Cost-efficiency: Customers can try it free for 2 weeks on the serverless tier, then pricing will be $0.0032 per audio minute, significantly cheaper than other streaming audio services that charge $0.0050 to $0.0070
We have options for upgraded serverless endpoints or dedicated deployments. Contact us if you need higher rate limits, SLAs, lower bulk pricing or single-tenant serving.
Get Started
Get started with the serverless streaming Fireworks audio endpoint today in code through docs or through our realtime transcription streaming notebook. Alternatively, get started with a button click in our UI playground. The playground lets you directly record or upload audio and run it through Fireworks’ audio APIs. Get code to recreate your UI call in one click.
Using the endpoint in code is simple - check out this the example below: it streams short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.
Streaming Audio with your Compound AI system
Our streaming audio service pairs perfectly with services like our text inference to power use cases like live voice agents. Streaming audio can be one component of a broader compound AI approach that integrates speech, text, image, and specialized models.
Fireworks makes it easy to build compound AI systems, by providing one place for:
Inference: Run all types of models and components fast and cost-effectively
Models and modalities: Get all the models you need for your system in one place, across modalities like text, audio, image and vision understanding
Adaptability: Tune and optimize models for quality and speed to suit your use case
Compound AI: Coordinate and run components together by using Fireworks frameworks and tools like function calling and JSON mode
Keep in touch with us on Discord or Twitter. Stay tuned for more updates coming soon about Fireworks and audio!