Text Models - Fireworks AI Docs

New to Fireworks? Start with the Serverless Quickstart for a step-by-step guide to making your first API call.

Fireworks provides fast, cost-effective access to leading open-source text models through OpenAI-compatible APIs. Query models via serverless inference or dedicated deployments using the chat completions API (recommended), completions API, or responses API. Browse 100+ available models →

Chat Completions API

Python
JavaScript
curl

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{
        "role": "user",
        "content": "Explain quantum computing in simple terms"
    }]
)

print(response.choices[0].message.content)

Most models automatically format your messages with the correct template. To verify the exact prompt used, enable the echo parameter.

Alternative query methods

Fireworks also supports Completions API and Responses API.

Querying dedicated deployments

For consistent performance, guaranteed capacity, or higher throughput, you can query on-demand deployments instead of serverless models. Deployments use the same APIs with a deployment-specific model identifier:

<MODEL_NAME>#<DEPLOYMENT_NAME>

For example:

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1#accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    messages=[{"role": "user", "content": "Hello"}]
)

Common patterns

Multi-turn conversations

Maintain conversation history by including all previous messages:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"}
]

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=messages
)

print(response.choices[0].message.content)

The model uses the full conversation history to provide contextually relevant responses.

System prompts

Override the default system prompt by setting the first message with role: "system":

messages = [
    {"role": "system", "content": "You are a helpful Python expert who provides concise code examples."},
    {"role": "user", "content": "How do I read a CSV file?"}
]

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=messages
)

To completely omit the system prompt, set the first message’s content to an empty string.

Streaming responses

Stream tokens as they’re generated for real time, interactive UX. Covered in detail in the Serverless Quickstart.

stream = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Aborting streams: Close the connection to stop generation and avoid billing for ungenerated tokens:

for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
    if some_condition:
        stream.close()
        break

Async requests

Use async clients to make multiple concurrent requests for better throughput:

Python
JavaScript

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

async def main():
    response = await client.chat.completions.create(
        model="accounts/fireworks/models/deepseek-v3p1",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response.choices[0].message.content)

asyncio.run(main())

Usage & performance tracking

Every response includes token usage information and performance metrics for debugging and observability. For aggregate metrics over time, see the usage dashboard. Token usage (prompt, completion, total tokens) is included in the response body for all requests. Performance metrics (latency, time-to-first-token, etc.) are included in response headers for non-streaming requests. For streaming requests, use the perf_metrics_in_response parameter to include all metrics in the response body.

Non-streaming
Streaming (usage only)
Streaming (with performance metrics)

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}]
)

# Token usage (always included)
print(response.usage.prompt_tokens)      # Tokens in your prompt
print(response.usage.completion_tokens)  # Tokens generated
print(response.usage.total_tokens)       # Total tokens billed

# Performance metrics are in response headers:
# fireworks-prompt-tokens, fireworks-server-time-to-first-token, etc.

Usage information is automatically included in the final chunk for streaming responses (the chunk with finish_reason set). This is a Fireworks extension - OpenAI SDK doesn’t return usage for streaming by default.

For all available metrics and details, see the API reference documentation.

If you encounter errors during inference, see Inference Error Codes for common issues and resolutions.

Advanced capabilities

Extend text models with additional features for structured outputs, tool integration, and performance optimization:

Tool calling

Connect models to external tools and APIs with type-safe parameters

Structured outputs

Enforce JSON schemas for reliable data extraction

Responses API

Multi-step reasoning for complex problem-solving

Predicted outputs

Speed up edits by predicting unchanged sections

Prompt caching

Cache common prompts to reduce latency and cost

Batch inference

Process large volumes of requests asynchronously

Configuration & debugging

Sampling parameters

Control how the model generates text. Fireworks automatically uses recommended sampling parameters from each model’s HuggingFace generation_config.json when you don’t specify them explicitly, ensuring optimal performance out-of-the-box.We pull temperature, top_k, top_p, min_p, and typical_p from the model’s configuration when not explicitly provided.

Temperature

Adjust randomness (0 = deterministic, higher = more creative):

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.7  # Override model default
)

Max tokens

Control the maximum number of tokens in the generated completion:

max_tokens=100  # Generate at most 100 tokens

Important notes:

Default value is 2048 tokens if not specified
Most models support up to their full context window (e.g., 128K for DeepSeek R1)
When the limit is reached, you’ll see "finish_reason": "length" in the response

Set max_tokens appropriately for your use case to avoid truncated responses. Check the model’s context window in the Model Library.

Top-p (nucleus sampling)

Consider only the most probable tokens summing to top_p probability mass:

top_p=0.9  # Consider top 90% probability mass

Top-k

Consider only the k most probable tokens:

top_k=50  # Consider top 50 tokens

Min-p

Exclude tokens below a probability threshold:

min_p=0.05  # Exclude tokens with <5% probability

Typical-p

Use typical sampling to select tokens with probability close to the entropy of the distribution:

typical_p=0.95  # Consider tokens with typical probability

Repetition penalties

Reduce repetitive text with frequency_penalty, presence_penalty, or repetition_penalty:

frequency_penalty=0.5,   # Penalize frequent tokens (OpenAI compatible)
presence_penalty=0.5,    # Penalize any repeated token (OpenAI compatible)
repetition_penalty=1.1   # Exponential penalty from prompt + output

Sampling options header

The fireworks-sampling-options header contains the actual default sampling parameters used for the model, including values from the model’s HuggingFace generation_config.json:

Python
JavaScript

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.with_raw_response.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}]
)

# Access headers from the raw response
sampling_options = response.headers.get('fireworks-sampling-options')
print(sampling_options)  # e.g., '{"temperature": 0.7, "top_p": 0.9}'

completion = response.parse()  # get the parsed response object
print(completion.choices[0].message.content)

See the API reference for detailed parameter descriptions.

Multiple generations

Generate multiple completions in one request:

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    n=3  # Generate 3 different jokes
)

for choice in response.choices:
    print(choice.message.content)

Token probabilities (logprobs)

Inspect token probabilities for debugging or analysis:

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    logprobs=True,
    top_logprobs=5  # Show top 5 alternatives per token
)

for content in response.choices[0].logprobs.content:
    print(f"Token: {content.token}, Logprob: {content.logprob}")

Prompt inspection (echo & raw_output)

Verify how your prompt was formatted:Echo: Return the prompt along with the generation:

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    echo=True
)

Raw output: See raw token IDs and prompt fragments:

Experimental API - may change without notice.

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    raw_output=True
)

print(response.raw_output.prompt_token_ids)  # Token IDs
print(response.raw_output.completion)        # Raw completion

Ignore EOS token

Force generation to continue past the end-of-sequence token (useful for benchmarking):

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    ignore_eos=True,
    max_tokens=100  # Will always generate exactly 100 tokens
)

Output quality may degrade when ignoring EOS. This API is experimental and should not be relied upon for production use cases.

Logit bias

Modify token probabilities to encourage or discourage specific tokens:

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    logit_bias={
        123: 10.0,   # Strongly encourage token ID 123
        456: -50.0   # Strongly discourage token ID 456
    }
)

Mirostat sampling

Control perplexity dynamically using the Mirostat algorithm:

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    mirostat_target=5.0,  # Target perplexity
    mirostat_lr=0.1       # Learning rate for adjustments
)

Understanding tokens

Language models process text in chunks called tokens. In English, a token can be as short as one character or as long as one word. Different model families use different tokenizers, so the same text may translate to different token counts depending on the model. Why tokens matter:

Models have maximum context lengths measured in tokens
Pricing is based on token usage (prompt + completion)
Token count affects response time

For Llama models, use this tokenizer tool to estimate token counts. Actual usage is returned in the usage field of every API response.

OpenAI SDK Migration

OpenAI SDK compatibility notes

Fireworks provides an OpenAI-compatible API, making migration straightforward. However, there are some minor differences to be aware of:

Behavioral differences

stop parameter:

Fireworks: Returns text including the stop word
OpenAI: Omits the stop word
You can easily truncate it client-side if needed

max_tokens with context limits:

Fireworks: Automatically adjusts max_tokens lower if prompt + max_tokens exceeds the model’s context window
OpenAI: Returns an invalid request error
Control this behavior with the context_length_exceeded_behavior parameter

Streaming usage stats:

Fireworks: Returns usage field in the final chunk (where finish_reason is set) for both streaming and non-streaming
OpenAI: Only returns usage for non-streaming responses

Example accessing streaming usage:

for chunk in client.chat.completions.create(stream=True, ...):
    if chunk.usage:  # Available in final chunk
        print(f"Tokens: {chunk.usage.total_tokens}")

Unsupported parameters

The following OpenAI parameters are not yet supported:

presence_penalty
frequency_penalty
best_of (use n instead)
logit_bias
functions (deprecated - use Tool Calling with the tools parameter instead)

Have a use case requiring one of these? Join our Discord to discuss.

Next steps

Vision models

Process images alongside text

Audio models

Transcribe and translate audio

Embeddings

Generate vector representations for search

On-demand deployments

Deploy models on dedicated GPUs

Fine-tuning

Customize models for your use case

Error codes

Troubleshoot common inference errors

API Reference

Complete API documentation

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Chat Completions API

​Alternative query methods

​Querying dedicated deployments

​Common patterns

​Multi-turn conversations

​System prompts

​Streaming responses

​Async requests

​Usage & performance tracking

​Advanced capabilities

Tool calling

Structured outputs

Responses API

Predicted outputs

Prompt caching

Batch inference

​Configuration & debugging

​Temperature

​Max tokens

​Top-p (nucleus sampling)

​Top-k

​Min-p

​Typical-p

​Repetition penalties

​Sampling options header

​Understanding tokens

​OpenAI SDK Migration

​Behavioral differences

​Unsupported parameters

​Next steps

Vision models

Audio models

Embeddings

On-demand deployments

Fine-tuning

Error codes

API Reference

Chat Completions API

Alternative query methods

Querying dedicated deployments

Common patterns

Multi-turn conversations

System prompts

Streaming responses

Async requests

Usage & performance tracking

Advanced capabilities

Configuration & debugging

Temperature

Max tokens

Top-p (nucleus sampling)

Top-k

Min-p

Typical-p

Repetition penalties

Sampling options header

Understanding tokens

OpenAI SDK Migration

Behavioral differences

Unsupported parameters

Next steps