Today we’re excited to announce Streaming Transcription V2 and Streaming Speaker Diarization! Streaming Transcription V2 is our new real-time speech-to-text transcription API and Streaming Speaker diarization can determine who spoke when in an audio stream in real-time. This new release extends our existing streaming transcription offering and powers use cases like voice agents, live captioning, customer support automation, and more!
Audio is one of the most natural ways to connect with AI. Much of the world’s information is spoken, and users increasingly expect AI applications to reason over audio - whether that’s transcribing lecture notes, analyzing voice chats, or powering conversational agents.
But some applications can’t wait until an audio file is finished.
Delays in transcription can limit automation and frustrate users. That’s why earlier this year, we introduced Fireworks Streaming Transcription. Today, we’re taking a step further with the launch of Streaming Transcription V2, a faster, lower-latency, and cross-region API alongside Streaming Speaker Diarization.
Speed: Our existing Streaming API can achieve up to 300ms end-to-end user-facing latency and handles raw PCM audio at 16kHz in mono. Streaming Transcription V2 pushes this further, delivering up to 25% lower latency than V1 in our benchmarks.
Quality: Based on our evaluation, Streaming Transcription V2 can outperform V1 in our internal noisy meeting dataset with lower WER (see below chart). Similar to Streaming Transcription V1, users have the option to use streaming audio alongside a voice activity detection (VAD) model to remove silent intervals and improve transcription accuracy.
Cost-efficiency
Pricing will be $0.0035 per audio minute, which is significantly cheaper than other streaming audio services that charge $0.0050 to $0.0070 per minute. Transcribing a day of audio only costs a little more than $5!
Both the existing streaming transcription service (V1) and the new streaming transcription V2 are available and will continue to be supported.
Speaker diarization is the process of determining who spoke when in an audio stream. It’s essential for making transcripts easy to follow, powering per-speaker analytics, and creating clear captions for multi-speaker conversations.
Earlier this year, we introduced pre-recorded speaker diarization, which helps analyze past conversations with high accuracy and scalability. Since then, one of the most common requests we’ve heard is: “Can you do this live?”
We’re excited to introduce streaming speaker diarization, now available in closed beta. This brings the same diarization capabilities from our pre-recorded pipeline to streaming audio, allowing you to continuously identify and track speakers over the course of a live session.
Streaming diarization is built with:
Together, streaming transcription and streaming diarization unlock powerful real-time use cases - from call center analytics and agent assist to live captions for broadcasts, to interactive voice agents that understand not just what was said, but also who said it.
See streaming diarization in action, speakers are identified live and remain consistent across the conversation (source)
Using the Streaming Transcription V2 is simple. It’s fully compatible with the existing streaming transcription API - you only need to switch to the new endpoint.
You can start today in a few ways:
Streaming diarization is currently available in closed beta. If you’re interested in trying it out, contact us to request access.