Join us for "Own Your AI" night on 10/1 in SF featuring Meta, Uber, Upwork, and AWS. Register here

Blog
Audio September Release

Audio September Release - Streaming Transcription V2 and Streaming Speaker Diarization

Audio September Release

Today we’re excited to announce Streaming Transcription V2 and Streaming Speaker Diarization! Streaming Transcription V2 is our new real-time speech-to-text transcription API and Streaming Speaker diarization can determine who spoke when in an audio stream in real-time. This new release extends our existing streaming transcription offering and powers use cases like voice agents, live captioning, customer support automation, and more!

Streaming Transcription

Audio is one of the most natural ways to connect with AI. Much of the world’s information is spoken, and users increasingly expect AI applications to reason over audio - whether that’s transcribing lecture notes, analyzing voice chats, or powering conversational agents.

But some applications can’t wait until an audio file is finished.

  • Call centers need immediate transcripts to guide ongoing conversations.
  • Live broadcasts rely on real-time captions to keep viewers engaged.
  • Voice agents require instant transcription to deliver timely responses.

Delays in transcription can limit automation and frustrate users. That’s why earlier this year, we introduced Fireworks Streaming Transcription. Today, we’re taking a step further with the launch of Streaming Transcription V2, a faster, lower-latency, and cross-region API alongside Streaming Speaker Diarization.

What’s new in the Streaming Transcription V2?

Speed: Our existing Streaming API can achieve up to 300ms end-to-end user-facing latency and handles raw PCM audio at 16kHz in mono. Streaming Transcription V2 pushes this further, delivering up to 25% lower latency than V1 in our benchmarks.

Quality: Based on our evaluation, Streaming Transcription V2 can outperform V1 in our internal noisy meeting dataset with lower WER (see below chart). Similar to Streaming Transcription V1, users have the option to use streaming audio alongside a voice activity detection (VAD) model to remove silent intervals and improve transcription accuracy.

WER Comparison

Cost-efficiency

Pricing will be $0.0035 per audio minute, which is significantly cheaper than other streaming audio services that charge $0.0050 to $0.0070 per minute. Transcribing a day of audio only costs a little more than $5!

When to use V2 over V1

Both the existing streaming transcription service (V1) and the new streaming transcription V2 are available and will continue to be supported.

  • V1: Original release and the current default endpoint. Recommended if you need more cost-efficient streaming transcription or do not have latency needs.
  • V2: For use cases that require lower latency and accuracy in noisy situations

Streaming Speaker Diarization

Why diarization?

Speaker diarization is the process of determining who spoke when in an audio stream. It’s essential for making transcripts easy to follow, powering per-speaker analytics, and creating clear captions for multi-speaker conversations.

Earlier this year, we introduced pre-recorded speaker diarization, which helps analyze past conversations with high accuracy and scalability. Since then, one of the most common requests we’ve heard is: “Can you do this live?”

What’s new?

We’re excited to introduce streaming speaker diarization, now available in closed beta. This brings the same diarization capabilities from our pre-recorded pipeline to streaming audio, allowing you to continuously identify and track speakers over the course of a live session.

Streaming diarization is built with:

  • Speaker continuity - Consistent speaker IDs across the session for easier tracking.
  • Flexible integration - Diarization tags arrive alongside transcription results at the word level
  • Scalability - Built on the same infrastructure that powers our streaming transcription API, supporting thousands of concurrent live streams with the reliability and cost-efficiency customers already expect from Fireworks

Together, streaming transcription and streaming diarization unlock powerful real-time use cases - from call center analytics and agent assist to live captions for broadcasts, to interactive voice agents that understand not just what was said, but also who said it.

See streaming diarization in action, speakers are identified live and remain consistent across the conversation (source)

Get Started

Streaming Transcription V2

Using the Streaming Transcription V2 is simple. It’s fully compatible with the existing streaming transcription API - you only need to switch to the new endpoint.

You can start today in a few ways:

Streaming Diarization

Streaming diarization is currently available in closed beta. If you’re interested in trying it out, contact us to request access.