20x faster Whisper than OpenAI - Fireworks audio transcribes 1 Hour in 4 Seconds
By Fireworks AI |12/9/2024
Llama 3.3 70B Instruct just dropped, featuring improved reasoning, math, and instruction-following. Try it out!
By Fireworks AI |12/9/2024
Today, Fireworks is thrilled to announce the beta release of our speech-to-text APIs that support the Whisper v3-large models. Use it free for the next 2 weeks! Key features include:
Why it matters: Audio transcription and translation use cases are exploding in importance. Fireworks audio excels at real-world, production use cases. Fireworks’ speed enables unmatched user experiences and our complete feature slate makes it easy to get the best quality and production-readiness.
The compound AI audio opportunity
At Fireworks, we believe we’re entering a new era of multi-modal, audio-driven AI. Products like NotebookLM (and open variants built on Fireworks) demonstrate how audio and text AI can combine to create magical user experiences. Fireworks customers like Cresta are innovating to create audio-first assistants while other customers create audio-based language learning assistants, tutors or call summarizers.
These combined audio and text experiences are a marquee example of the power of compound AI. Compound AI describes AI systems created with multiple components like models, API calls, data stores and/or retrievers. These systems stand in contrast to relying on a single call to one large, foundational model.
Why speed matters
Low-latency is critical for engaging audio experiences. For example users need to wait multiple seconds for an audio agent or minutes to process an audio file, this quickly breaks product immersion and causes user frustration. Today, users expect generative AI to respond immediately, not in minutes or hours.
Fireworks offers 2 main audio API options:
Besides simply providing these models, Fireworks provides tooling for improving audio quality and usability for your application. Across the 3 APIs, we support the following options:
These APIs come in 2 deployment methods:
The fastest Whisper
We’ve made the fastest Whisper model as measured by Artificial Analysis. Our speed improvements come from serving stack innovations from years of experience building Pytorch.
The asynchronous Speech-to-text API can transcribe or translate 1 hour of audio in 4 seconds (or 900x real-time). Results on a dedicated endpoints result in:
We also observe large speed wins with our real-time speech API. Audio chunks are transcribed with ~200ms latency resulting in human-feeling experiences.
Async Transcription Measurements
Tested using FLAC encoded mono 16 000 sample rate on a dedicated endpoint with 1 FAU;
Maintaining quality
While making speed optimizations, we ensured that quality was not compromised. Word Error Rate (WER) metric is commonly used today to evaluate speech recognition quality. We report WER of 2.00% for greedy inference for whisper-v3-large on the Librispeech Clean dataset (see leaderboard), matching the original open-sourced whisper-v3-large.
Many transcription use cases need more than just text transcriptions: they require information on when words, sentences and silent intervals occurred. For example:
Fireworks Audio APIs supports text alignment out-of-the-box, providing start and end timestamps for each word and sentence. We support 3 aligner models: Gentle (the industry’s leading text aligner), mms_fa and tdnn_ffn. See below for an example of aligner output, generated from this example notebook.
To provide the fastest, most efficient transcription, Fireworks innovated on audio deployment methodology by offering Gentle on GPU compute, instead of CPU compute. This optimization enables up to 10x more efficient compute for Gentle-based alignment, with no accuracy impact. Fireworks can align 1 hour of audio in 10 seconds. Alignment is included at no extra charge.
Beyond offering this aligner, we also offer other transcription optionizations and tools:
Get started with the serverless Fireworks audio endpoints today. These endpoints are offered free for the next two weeks. Users will be notified before pricing is enabled or if they run into rate limits.
Try the endpoints directly in code or directly try them in our UI playground. Our playground provides a UI to directly record or upload audio and run it through Fireworks’ audio APIs. Get code to recreate your UI call in one click.
For inspiration, try out this cookbook. For the best performance, contact us at [email protected] for a dedicated endpoint
Audio adds to a growing list of modalities on Fireworks, including text, image, vision understanding and embeddings models. Fireworks makes it easy to build compound AI systems, by providing one place for:
Keep in touch with us on Discord or Twitter. Stay tuned for more updates coming soon about Fireworks and audio!