1

Open a WebSocket

Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint.

2

Stream audio and receive transcriptions

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.

URL

Please use the following serverless endpoint:

wss://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming

Headers

Authorization
string
required

Your Fireworks API key, e.g. Authorization=API_KEY. Alternatively, can be provided as a query param.

Query Parameters

response_format
string
default:"verbose_json"

The format in which to return the response. Currently only verbose_json is recommended for streaming.

language
string | null

The target language for transcription. See the Supported Languages section below for a complete list of available languages.

prompt
string | null

The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. Um, here's, uh, what was recorded. will make the model to include the filler words into the transcription.

temperature
float
default:"0"

Sampling temperature to use when decoding text tokens during transcription.

Streaming Audio

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will:

  1. Resample your audio to 16 kHz if it is not already.
  2. Convert it to mono.
  3. Send 50ms chunks (16,000 Hz * 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format.

Handling Responses

The client maintains a state dictionary, starting with an empty dictionary {}. When the server sends the first transcription message, it contains a list of segments. Each segment has an id and text:

# Server initial message:
{
    "segments": [
        {"id": "0", "text": "This is the first sentence"},
        {"id": "1", "text": "This is the second sentence"}
    ]
}

# Client initial state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence",
}

When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment id:

# Server continuous message:
{
    "segments": [
        {"id": "1", "text": "This is the second sentence modified"},
        {"id": "2", "text": "This is the third sentence"}
    ]
}

# Client updated state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence modified",   # overwritten
    "2": "This is the third sentence",             # new
}

Example Usage

Check out a brief Python example below or example sources:

!pip3 install requests torch torchaudio websocket-client

import io
import time
import json
import torch
import requests
import torchaudio
import threading
import websocket
import urllib.parse

lock = threading.Lock()
state = {}

def on_open(ws):
    def send_audio_chunks():
        for chunk in audio_chunk_bytes:
            ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
            time.sleep(chunk_size_ms / 1000)

        final_checkpoint = json.dumps({"checkpoint_id": "final"})
        ws.send(final_checkpoint, opcode=websocket.ABNF.OPCODE_TEXT)

    threading.Thread(target=send_audio_chunks).start()

def on_message(ws, message):
    message = json.loads(message)
    if message.get("checkpoint_id") == "final":
        ws.close()
        return

    update = {s["id"]: s["text"] for s in message["segments"]}
    with lock:
        state.update(update)
        print("\n".join(f" - {k}: {v}" for k, v in state.items()))

def on_error(ws, error):
    print(f"WebSocket error: {error}")

# Open a connection URL with query params
url = "ws://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
    "language": "en",
})
ws = websocket.WebSocketApp(
    f"{url}?{params}",
    header={"Authorization": "<FIREWORKS_API_KEY>"},
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
)
ws.run_forever()

Dedicated endpoint

For fixed throughput and predictable SLAs, you may request a dedicated endpoint for streaming transcription at [email protected] or discord.

Supported Languages

The following languages are supported for transcription:

Language CodeLanguage Name
enEnglish
zhChinese
deGerman
esSpanish
ruRussian
koKorean
frFrench
jaJapanese
ptPortuguese
trTurkish
plPolish
caCatalan
nlDutch
arArabic
svSwedish
itItalian
idIndonesian
hiHindi
fiFinnish
viVietnamese
heHebrew
ukUkrainian
elGreek
msMalay
csCzech
roRomanian
daDanish
huHungarian
taTamil
noNorwegian
thThai
urUrdu
hrCroatian
bgBulgarian
ltLithuanian
laLatin
miMaori
mlMalayalam
cyWelsh
skSlovak
teTelugu
faPersian
lvLatvian
bnBengali
srSerbian
azAzerbaijani
slSlovenian
knKannada
etEstonian
mkMacedonian
brBreton
euBasque
isIcelandic
hyArmenian
neNepali
mnMongolian
bsBosnian
kkKazakh
sqAlbanian
swSwahili
glGalician
mrMarathi
paPunjabi
siSinhala
kmKhmer
snShona
yoYoruba
soSomali
afAfrikaans
ocOccitan
kaGeorgian
beBelarusian
tgTajik
sdSindhi
guGujarati
amAmharic
yiYiddish
loLao
uzUzbek
foFaroese
htHaitian Creole
psPashto
tkTurkmen
nnNynorsk
mtMaltese
saSanskrit
lbLuxembourgish
myMyanmar
boTibetan
tlTagalog
mgMalagasy
asAssamese
ttTatar
hawHawaiian
lnLingala
haHausa
baBashkir
jwJavanese
suSundanese
yueCantonese