Deployments

New to deployments? Start with our Deployments Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.

On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:

Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
No hard rate limits – Only limited by your deployment’s capacity
Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
Broader model selection – Access models not available on serverless
Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere

Need higher GPU quotas or want to reserve capacity? Contact us.

Creating & querying deployments

Create a deployment:

# This command returns your DEPLOYMENT_NAME - save it for querying
firectl create deployment accounts/fireworks/models/<MODEL_NAME> --wait

See Deployment shapes below to optimize for speed, throughput, or cost. Query your deployment: After creating a deployment, query it using this format:

<MODEL_NAME>#<DEPLOYMENT_NAME>

You can find your deployment name anytime with firectl list deployments and firectl get deployment <DEPLOYMENT_ID>.

Examples:

Fireworks model
Custom model

accounts/fireworks/models/mixtral-8x7b#accounts/alice/deployments/12345678

Model: accounts/fireworks/models/mixtral-8x7b
Deployment: accounts/alice/deployments/12345678

You can also use shorthand: fireworks/mixtral-8x7b#alice/12345678

Code examples

Python
JavaScript
curl

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)

print(response.choices[0].message.content)

Deployment shapes

Deployment shapes are the primary way to configure deployments. They’re pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other performance factors.

Fast – Low latency for interactive workloads
Throughput – Cost-per-token at scale for high-volume workloads
Minimal – Lowest cost for testing or light workloads

Usage:

# List available shapes
firectl list deployment-shape-versions --base-model <model-id>

# Create with a shape (shorthand)
firectl create deployment accounts/fireworks/models/deepseek-v3 --deployment-shape throughput

# Create with full shape ID
firectl create deployment accounts/fireworks/models/llama-v3p3-70b-instruct \
  --deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast

# View shape details
firectl get deployment-shape-version <full-deployment-shape-version-id>

Need even better performance with tailored optimizations? Contact our team.

Managing & configuring deployments

Basic management

# List all deployments
firectl list deployments

# Check deployment status
firectl get deployment <DEPLOYMENT_ID>

# Delete a deployment
firectl delete deployment <DEPLOYMENT_ID>

By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.

GPU hardware

Choose GPU type with --accelerator-type:

NVIDIA_A100_80GB
NVIDIA_H100_80GB
NVIDIA_H200_141GB

GPU availability varies by region. See Hardware selection guide→

Autoscaling

Control replica counts, scale timing, and load targets for your deployment. See the Autoscaling guide for configuration options.

Multiple GPUs per replica

Use multiple GPUs to improve latency and throughput:

firectl create deployment <MODEL_NAME> --accelerator-count 2

More GPUs = faster generation. Note that scaling is sub-linear (2x GPUs ≠ 2x performance).

Advanced

Speculative decoding - Speed up text generation using draft models or n-gram speculation
Quantization - Reduce model precision (e.g., FP16 to FP8) to improve speeds and reduce costs by 30-50%
Performance benchmarking - Measure and optimize your deployment’s performance with load testing
Managing default deployments - Control which deployment handles queries when using just the model name
Publishing deployments - Make your deployment accessible to other Fireworks users

Next steps

Autoscaling

Configure autoscaling for optimal cost and performance

Upload custom models

Deploy your own models from Hugging Face

Quantization

Reduce costs with model quantization

Regions

Choose deployment regions for optimal latency

Reserved capacity

Purchase reserved GPUs for guaranteed capacity

Fine-tuning

Fine-tune models for your specific use case

Get Started

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Deployments

Creating & querying deployments

Code examples

Deployment shapes

Managing & configuring deployments

Basic management

GPU hardware

Autoscaling

Multiple GPUs per replica

Advanced

Next steps

Autoscaling

Upload custom models

Quantization

Regions

Reserved capacity

Fine-tuning

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Creating & querying deployments

​Code examples

​Deployment shapes

​Managing & configuring deployments

​Basic management

​GPU hardware

​Autoscaling

​Multiple GPUs per replica

​Advanced

​Next steps

Autoscaling

Upload custom models

Quantization

Regions

Reserved capacity

Fine-tuning

Creating & querying deployments

Code examples

Deployment shapes

Managing & configuring deployments

Basic management

GPU hardware

Autoscaling

Multiple GPUs per replica

Advanced

Next steps