Whisper V3
ServerlessAudio
ServerlessAudio
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.
Whisper V3 is available via Fireworks' Speech-to-Text APIs, where you are billed based on the duration of the transcribed audio. The API supports multiple languages and additional features, including forced alignment.
You can call the Fireworks Speech-to-Text API using HTTP requests from any language. You can see the API references here:
Generate a model response using the speech-transcription endpoint of whisper-v3. API reference
import requests with open("audio.mp3", "rb") as f: response = requests.post( "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions", headers={"Authorization": f"Bearer <YOUR_API_KEY>"}, files={"file": f}, data={ "model": "accounts/fireworks/models/whisper-v3", "temperature": "0", "vad_model": "silero" }, ) if response.status_code == 200: print(response.json()) else: print(f"Error: {response.status_code}", response.text)