Whisper V3
ServerlessAudio
ServerlessAudio
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.
Whisper V3 is available via Fireworks' Speech-to-Text APIs, where you are billed based on the duration of the transcribed audio. The API supports multiple languages and additional features, including forced alignment.
You can call the Fireworks Speech-to-Text API using HTTP requests from any language. You can see the API references here:
Transcribe audio to text in the language in which the audio was spoken.
import requests with open("<AUDIO_FILE_PATH>", "rb") as f: response = requests.post( "https://api.fireworks.ai/inference/v1/audio/transcriptions", headers={"Authorization": f"Bearer <YOUR_API_KEY>"}, files={"file": f}, data={ "model": "whisper-v3", "vad_model": "silero", "alignment_model": "tdnn_ffn", "preprocessing": "none", "temperature": "0", "timestamp_granularities": "segment" }, ) if response.status_code == 200: print(response.json()) else: print(f"Error: {response.status_code}", response.text)
Transcribe audio from many language to text in English.
import requests with open("<AUDIO_FILE_PATH>", "rb") as f: response = requests.post( "https://api.fireworks.ai/inference/v1/audio/translations", headers={"Authorization": f"Bearer <YOUR_API_KEY>"}, files={"file": f}, data={ "model": "whisper-v3", "vad_model": "silero", "alignment_model": "tdnn_ffn", "preprocessing": "none", "temperature": "0", "timestamp_granularities": "segment" }, ) if response.status_code == 200: print(response.json()) else: print(f"Error: {response.status_code}", response.text)
Run forced alignment over audio and a transcript. That is, compute start and end timestamp boundaries for each word and return the computed timestamps
import requests with open("<AUDIO_FILE_PATH>", "rb") as f: response = requests.post( "https://api.fireworks.ai/inference/v1/audio/alignments", headers={"Authorization": f"Bearer <YOUR_API_KEY>"}, files={"file": f}, data={ "text": "<TEXT_TO_ALIGN>", "vad_model": "silero", "alignment_model": "tdnn_ffn", "preprocessing": "none", "temperature": "0", "timestamp_granularities": "segment" }, ) if response.status_code == 200: print(response.json()) else: print(f"Error: {response.status_code}", response.text)