Skip to main content
New to Fireworks? Start with the Serverless Quickstart for a step-by-step guide to making your first API call.
Fireworks provides fast, cost-effective access to leading open-source text models through OpenAI-compatible APIs. Query models via serverless inference or dedicated deployments using the chat completions API (recommended), completions API, or responses API. Browse 100+ available models →

Chat Completions API

  • Python
  • JavaScript
  • curl
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{
        "role": "user",
        "content": "Explain quantum computing in simple terms"
    }]
)

print(response.choices[0].message.content)
Most models automatically format your messages with the correct template. To verify the exact prompt used, enable the echo parameter.

Alternative query methods

Fireworks also supports Completions API and Responses API.

Querying dedicated deployments

For consistent performance, guaranteed capacity, or higher throughput, you can query on-demand deployments instead of serverless models. Deployments use the same APIs with a deployment-specific model identifier:
<MODEL_NAME>#<DEPLOYMENT_NAME>
For example:
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1#accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    messages=[{"role": "user", "content": "Hello"}]
)

Common patterns

Multi-turn conversations

Maintain conversation history by including all previous messages:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"}
]

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=messages
)

print(response.choices[0].message.content)
The model uses the full conversation history to provide contextually relevant responses.

System prompts

Override the default system prompt by setting the first message with role: "system":
messages = [
    {"role": "system", "content": "You are a helpful Python expert who provides concise code examples."},
    {"role": "user", "content": "How do I read a CSV file?"}
]

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=messages
)
To completely omit the system prompt, set the first message’s content to an empty string.

Streaming responses

Stream tokens as they’re generated for real time, interactive UX. Covered in detail in the Serverless Quickstart.
stream = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Aborting streams: Close the connection to stop generation and avoid billing for ungenerated tokens:
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
    if some_condition:
        stream.close()
        break

Async requests

Use async clients to make multiple concurrent requests for better throughput:
  • Python
  • JavaScript
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

async def main():
    response = await client.chat.completions.create(
        model="accounts/fireworks/models/deepseek-v3p1",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response.choices[0].message.content)

asyncio.run(main())

Usage & performance tracking

Every response includes token usage information and performance metrics for debugging and observability. For aggregate metrics over time, see the usage dashboard. Token usage (prompt, completion, total tokens) is included in the response body for all requests. Performance metrics (latency, time-to-first-token, etc.) are included in response headers for non-streaming requests. For streaming requests, use the perf_metrics_in_response parameter to include all metrics in the response body.
  • Non-streaming
  • Streaming (usage only)
  • Streaming (with performance metrics)
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}]
)

# Token usage (always included)
print(response.usage.prompt_tokens)      # Tokens in your prompt
print(response.usage.completion_tokens)  # Tokens generated
print(response.usage.total_tokens)       # Total tokens billed

# Performance metrics are in response headers:
# fireworks-prompt-tokens, fireworks-server-time-to-first-token, etc.
Usage information is automatically included in the final chunk for streaming responses (the chunk with finish_reason set). This is a Fireworks extension - OpenAI SDK doesn’t return usage for streaming by default.
For all available metrics and details, see the API reference documentation.
If you encounter errors during inference, see Inference Error Codes for common issues and resolutions.

Advanced capabilities

Extend text models with additional features for structured outputs, tool integration, and performance optimization:

Configuration & debugging

Control how the model generates text. Fireworks automatically uses recommended sampling parameters from each model’s HuggingFace generation_config.json when you don’t specify them explicitly, ensuring optimal performance out-of-the-box.We pull temperature, top_k, top_p, min_p, and typical_p from the model’s configuration when not explicitly provided.

Temperature

Adjust randomness (0 = deterministic, higher = more creative):
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.7  # Override model default
)

Max tokens

Control the maximum number of tokens in the generated completion:
max_tokens=100  # Generate at most 100 tokens
Important notes:
  • Default value is 2048 tokens if not specified
  • Most models support up to their full context window (e.g., 128K for DeepSeek R1)
  • When the limit is reached, you’ll see "finish_reason": "length" in the response
Set max_tokens appropriately for your use case to avoid truncated responses. Check the model’s context window in the Model Library.

Top-p (nucleus sampling)

Consider only the most probable tokens summing to top_p probability mass:
top_p=0.9  # Consider top 90% probability mass

Top-k

Consider only the k most probable tokens:
top_k=50  # Consider top 50 tokens

Min-p

Exclude tokens below a probability threshold:
min_p=0.05  # Exclude tokens with <5% probability

Typical-p

Use typical sampling to select tokens with probability close to the entropy of the distribution:
typical_p=0.95  # Consider tokens with typical probability

Repetition penalties

Reduce repetitive text with frequency_penalty, presence_penalty, or repetition_penalty:
frequency_penalty=0.5,   # Penalize frequent tokens (OpenAI compatible)
presence_penalty=0.5,    # Penalize any repeated token (OpenAI compatible)
repetition_penalty=1.1   # Exponential penalty from prompt + output

Sampling options header

The fireworks-sampling-options header contains the actual default sampling parameters used for the model, including values from the model’s HuggingFace generation_config.json:
  • Python
  • JavaScript
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.with_raw_response.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}]
)

# Access headers from the raw response
sampling_options = response.headers.get('fireworks-sampling-options')
print(sampling_options)  # e.g., '{"temperature": 0.7, "top_p": 0.9}'

completion = response.parse()  # get the parsed response object
print(completion.choices[0].message.content)
See the API reference for detailed parameter descriptions.
Generate multiple completions in one request:
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    n=3  # Generate 3 different jokes
)

for choice in response.choices:
    print(choice.message.content)
Inspect token probabilities for debugging or analysis:
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    logprobs=True,
    top_logprobs=5  # Show top 5 alternatives per token
)

for content in response.choices[0].logprobs.content:
    print(f"Token: {content.token}, Logprob: {content.logprob}")
Verify how your prompt was formatted:Echo: Return the prompt along with the generation:
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    echo=True
)
Raw output: See raw token IDs and prompt fragments:
Experimental API - may change without notice.
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    raw_output=True
)

print(response.raw_output.prompt_token_ids)  # Token IDs
print(response.raw_output.completion)        # Raw completion
Force generation to continue past the end-of-sequence token (useful for benchmarking):
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    ignore_eos=True,
    max_tokens=100  # Will always generate exactly 100 tokens
)
Output quality may degrade when ignoring EOS. This API is experimental and should not be relied upon for production use cases.
Modify token probabilities to encourage or discourage specific tokens:
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    logit_bias={
        123: 10.0,   # Strongly encourage token ID 123
        456: -50.0   # Strongly discourage token ID 456
    }
)
Control perplexity dynamically using the Mirostat algorithm:
response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Hello"}],
    mirostat_target=5.0,  # Target perplexity
    mirostat_lr=0.1       # Learning rate for adjustments
)

Understanding tokens

Language models process text in chunks called tokens. In English, a token can be as short as one character or as long as one word. Different model families use different tokenizers, so the same text may translate to different token counts depending on the model. Why tokens matter:
  • Models have maximum context lengths measured in tokens
  • Pricing is based on token usage (prompt + completion)
  • Token count affects response time
For Llama models, use this tokenizer tool to estimate token counts. Actual usage is returned in the usage field of every API response.

OpenAI SDK Migration

Fireworks provides an OpenAI-compatible API, making migration straightforward. However, there are some minor differences to be aware of:

Behavioral differences

stop parameter:
  • Fireworks: Returns text including the stop word
  • OpenAI: Omits the stop word
  • You can easily truncate it client-side if needed
max_tokens with context limits:
  • Fireworks: Automatically adjusts max_tokens lower if prompt + max_tokens exceeds the model’s context window
  • OpenAI: Returns an invalid request error
  • Control this behavior with the context_length_exceeded_behavior parameter
Streaming usage stats:
  • Fireworks: Returns usage field in the final chunk (where finish_reason is set) for both streaming and non-streaming
  • OpenAI: Only returns usage for non-streaming responses
Example accessing streaming usage:
for chunk in client.chat.completions.create(stream=True, ...):
    if chunk.usage:  # Available in final chunk
        print(f"Tokens: {chunk.usage.total_tokens}")

Unsupported parameters

The following OpenAI parameters are not yet supported:
  • presence_penalty
  • frequency_penalty
  • best_of (use n instead)
  • logit_bias
  • functions (deprecated - use Tool Calling with the tools parameter instead)
Have a use case requiring one of these? Join our Discord to discuss.

Next steps