Streaming Transcription

Open a WebSocket

Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint.

Stream audio and receive transcriptions

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.

Try Python notebook

Stream audio to get transcription continuously in real-time.

Explore Python sources

Stream audio to get transcription continuously in real-time.

Explore Node.js sources

Stream audio to get transcription continuously in real-time.

URLs

Fireworks provides serverless, real-time ASR via WebSocket endpoints. Please select the appropriate version:

Streaming ASR v1 (default)

Production-ready and generally recommended for all use cases.

wss://audio-streaming.api.fireworks.ai/v1/audio/transcriptions/streaming

Streaming ASR v2 (preview)

An early-access version of our next-generation streaming transcription service. V2 is good for use cases that require lower latency and higher accuracy in noisy situations.

wss://audio-streaming-v2.api.fireworks.ai/v1/audio/transcriptions/streaming

Headers

Authorization

string

required

Your Fireworks API key, e.g. Authorization=API_KEY. Alternatively, can be provided as a query param.

Query Parameters

Authorization

string

Your Fireworks API key. Required when headers cannot be set (e.g., browser WebSocket connections). Can alternatively be provided via the Authorization header.

response_format

string

default:"verbose_json"

The format in which to return the response. Currently only verbose_json is recommended for streaming.

language

string | null

The target language for transcription. See the Supported Languages section below for a complete list of available languages.

prompt

string | null

The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. Um, here's, uh, what was recorded. will make the model to include the filler words into the transcription.

temperature

float

default:"0"

Sampling temperature to use when decoding text tokens during transcription.

timestamp_granularities

string | list[string] | null

The timestamp granularities to populate for this streaming transcription. Defaults to null. Set to word,segment to enable timestamp granularities. Use a list for timestamp_granularities in all client libraries. A comma-separated string like word,segment only works when manually included in the URL (e.g. in curl).

Client messages

binary
SttStateClear
SttInputTrace

This field is for client to send audio chunks over to server. Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono).

Server messages

json
SttStateCleared
SttOutputTrace

task

string

default:"transcribe"

required

The task that was performed — either transcribe or translate.

language

string

required

The language of the transcribed/translated text.

text

string

required

The transcribed/translated text.

words

object[] | null

Extracted words and their corresponding timestamps.

Show Word properties

word

string

required

The text content of the word.

language

string

required

The language of the word.

probability

number

required

The probability of the word.

hallucination_score

number

required

The hallucination score of the word.

start

number

Start time of the word in seconds. Appears only when timestamp_granularities is set to word,segment.

end

number

End time of the word in seconds. Appears only when timestamp_granularities is set to word,segment.

is_final

bool

required

Indicates whether this word has been finalized.

segments

object[] | null

Segments of the transcribed/translated text and their corresponding details.

Show Segment properties (partial)

number

required

The ID of the segment.

text

string

required

The text content of the segment.

words

object[] | null

Extracted words in the segment.

start

number

Start time of the segment in seconds. Appears only when timestamp_granularities is set to word,segment.

end

number

End time of the segment in seconds. Appears only when timestamp_granularities is set to word,segment.

Streaming Audio

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will:

Resample your audio to 16 kHz if it is not already.
Convert it to mono.
Send 50ms chunks (16,000 Hz * 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format.

Handling Responses

The client maintains a state dictionary, starting with an empty dictionary {}. When the server sends the first transcription message, it contains a list of segments. Each segment has an id and text:

# Server initial message:
{
    "segments": [
        {"id": "0", "text": "This is the first sentence"},
        {"id": "1", "text": "This is the second sentence"}
    ]
}

# Client initial state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence",
}

When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment id:

# Server continuous message:
{
    "segments": [
        {"id": "1", "text": "This is the second sentence modified"},
        {"id": "2", "text": "This is the third sentence"}
    ]
}

# Client updated state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence modified",   # overwritten
    "2": "This is the third sentence",             # new
}

Handling Connection Interruptions & Timeouts

Real-time streaming transcription over WebSockets can run for a long time. The longer a WebSocket session runs, the more likely it is to experience interruptions from network glitches to service hiccups. It is important to be aware of this and build your client to recover gracefully so the stream keeps going without user impact. In the following section, we’ll outline recommended practices for handling connection interruptions and timeouts effectively.

When a connection drops

Although Fireworks is designed to keep streams running smoothly, occasional interruptions can still occur. If the WebSocket is disrupted (e.g., bandwidth limitation or network failures), your application must initialize a new WebSocket connection, start a fresh streaming session and begin sending audio as soon as the server confirms the connection is open.

Avoid losing audio during reconnects

While you’re reconnecting, audio could be still being produced and you could lose that audio segment if it is not transferred to our API during this period. To minimize the risk of dropping audio during a reconnect, one effective approach is to store the audio data in a buffer until it can re-establish the connection to our API and then sends the data for transcription.

Keep timestamps continuous across sessions

When timestamps are enabled, the result will include the start and end time of the segment in seconds. And each new WebSocket session will reset the timestamps to start from 00:00:00. To keep a continuous timeline, we recommend to maintain a running “stream start offset” in your app and add that offset to timestamps from each new session so they align with the overall audio timeline.

Example Usage

Check out a brief Python example below or example sources:

!pip3 install requests torch torchaudio websocket-client

import io
import time
import json
import torch
import requests
import torchaudio
import threading
import websocket
import urllib.parse

lock = threading.Lock()
state = {}

def on_open(ws):
    def send_audio_chunks():
        for chunk in audio_chunk_bytes:
            ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
            time.sleep(chunk_size_ms / 1000)

        final_checkpoint = json.dumps({"checkpoint_id": "final"})
        ws.send(final_checkpoint, opcode=websocket.ABNF.OPCODE_TEXT)

    threading.Thread(target=send_audio_chunks).start()

def on_message(ws, message):
    message = json.loads(message)
    if message.get("checkpoint_id") == "final":
        ws.close()
        return

    update = {s["id"]: s["text"] for s in message["segments"]}
    with lock:
        state.update(update)
        print("\n".join(f" - {k}: {v}" for k, v in state.items()))

def on_error(ws, error):
    print(f"WebSocket error: {error}")

# Open a connection URL with query params
url = "wss://audio-streaming.api.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
    "language": "en",
})
ws = websocket.WebSocketApp(
    f"{url}?{params}",
    header={"Authorization": "<FIREWORKS_API_KEY>"},
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
)
ws.run_forever()

Dedicated endpoint

For fixed throughput and predictable SLAs, you may request a dedicated endpoint for streaming transcription at [email protected] or discord.

Supported Languages

The following languages are supported for transcription:

Language Code	Language Name
en	English
zh	Chinese
de	German
es	Spanish
ru	Russian
ko	Korean
fr	French
ja	Japanese
pt	Portuguese
tr	Turkish
pl	Polish
ca	Catalan
nl	Dutch
ar	Arabic
sv	Swedish
it	Italian
id	Indonesian
hi	Hindi
fi	Finnish
vi	Vietnamese
he	Hebrew
uk	Ukrainian
el	Greek
ms	Malay
cs	Czech
ro	Romanian
da	Danish
hu	Hungarian
ta	Tamil
no	Norwegian
th	Thai
ur	Urdu
hr	Croatian
bg	Bulgarian
lt	Lithuanian
la	Latin
mi	Maori
ml	Malayalam
cy	Welsh
sk	Slovak
te	Telugu
fa	Persian
lv	Latvian
bn	Bengali
sr	Serbian
az	Azerbaijani
sl	Slovenian
kn	Kannada
et	Estonian
mk	Macedonian
br	Breton
eu	Basque
is	Icelandic
hy	Armenian
ne	Nepali
mn	Mongolian
bs	Bosnian
kk	Kazakh
sq	Albanian
sw	Swahili
gl	Galician
mr	Marathi
pa	Punjabi
si	Sinhala
km	Khmer
sn	Shona
yo	Yoruba
so	Somali
af	Afrikaans
oc	Occitan
ka	Georgian
be	Belarusian
tg	Tajik
sd	Sindhi
gu	Gujarati
am	Amharic
yi	Yiddish
lo	Lao
uz	Uzbek
fo	Faroese
ht	Haitian Creole
ps	Pashto
tk	Turkmen
nn	Nynorsk
mt	Maltese
sa	Sanskrit
lb	Luxembourgish
my	Myanmar
bo	Tibetan
tl	Tagalog
mg	Malagasy
as	Assamese
tt	Tatar
haw	Hawaiian
ln	Lingala
ha	Hausa
ba	Bashkir
jw	Javanese
su	Sundanese
yue	Cantonese
zh-hant	Traditional Chinese
zh-hans	Simplified Chinese

API Reference

Build SDK

Try Python notebook

Explore Python sources

Explore Node.js sources

URLs

Streaming ASR v1 (default)

Streaming ASR v2 (preview)

Headers

Query Parameters

Client messages

Server messages

Streaming Audio

Handling Responses

Handling Connection Interruptions & Timeouts

When a connection drops

Avoid losing audio during reconnects

Keep timestamps continuous across sessions

Example Usage

Dedicated endpoint

Supported Languages

API Reference

Build SDK

Try Python notebook

Explore Python sources

Explore Node.js sources

​URLs

​Streaming ASR v1 (default)

​Streaming ASR v2 (preview)

​Headers

​Query Parameters

​Client messages

​Server messages

​Streaming Audio

​Handling Responses

​Handling Connection Interruptions & Timeouts

​When a connection drops

​Avoid losing audio during reconnects

​Keep timestamps continuous across sessions

​Example Usage

​Dedicated endpoint

​Supported Languages

URLs

Streaming ASR v1 (default)

Streaming ASR v2 (preview)

Headers

Query Parameters

Client messages

Server messages

Streaming Audio

Handling Responses

Handling Connection Interruptions & Timeouts

When a connection drops

Avoid losing audio during reconnects

Keep timestamps continuous across sessions

Example Usage

Dedicated endpoint

Supported Languages