Skip to main content
New to deployments? Start with our On-Demand Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.
On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:
  • Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
  • No hard rate limits – Only limited by your deployment’s capacity
  • Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
  • Broader model selection – Access models not available on serverless
  • Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere
Need higher GPU quotas or want to reserve capacity? Contact us.

Creating & querying deployments

Create a deployment:
# This command returns your DEPLOYMENT_NAME - save it for querying
firectl create deployment accounts/fireworks/models/<MODEL_NAME> --wait
See Deployment shapes below to optimize for speed, throughput, or cost. Query your deployment: After creating a deployment, query it using this format:
<MODEL_NAME>#<DEPLOYMENT_NAME>
You can find your deployment name anytime with firectl list deployments and firectl get deployment <DEPLOYMENT_ID>.
Examples:
  • Fireworks model
  • Custom model
accounts/fireworks/models/mixtral-8x7b#accounts/alice/deployments/12345678
  • Model: accounts/fireworks/models/mixtral-8x7b
  • Deployment: accounts/alice/deployments/12345678
You can also use shorthand: fireworks/mixtral-8x7b#alice/12345678

Code examples

  • Python
  • JavaScript
  • curl
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)

print(response.choices[0].message.content)

Deployment shapes

Deployment shapes are the primary way to configure deployments. They’re pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other performance factors.
  • Fast – Low latency for interactive workloads
  • Throughput – Cost-per-token at scale for high-volume workloads
  • Minimal – Lowest cost for testing or light workloads
Usage:
# List available shapes
firectl list deployment-shape-versions --base-model <model-id>

# Create with a shape (shorthand)
firectl create deployment accounts/fireworks/models/deepseek-v3 --deployment-shape throughput

# Create with full shape ID
firectl create deployment accounts/fireworks/models/llama-v3p3-70b-instruct \
  --deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast

# View shape details
firectl get deployment-shape-version <full-deployment-shape-version-id>
Need even better performance with tailored optimizations? Contact our team.

Managing & configuring deployments

Basic management

# List all deployments
firectl list deployments

# Check deployment status
firectl get deployment <DEPLOYMENT_ID>

# Delete a deployment
firectl delete deployment <DEPLOYMENT_ID>
By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.

GPU hardware

Choose GPU type with --accelerator-type:
  • NVIDIA_A100_80GB
  • NVIDIA_H100_80GB
  • NVIDIA_H200_141GB
GPU availability varies by region. See Hardware selection guide→

Autoscaling

Control replica counts, scale timing, and load targets for your deployment. See the Autoscaling guide for configuration options.

Multiple GPUs per replica

Use multiple GPUs to improve latency and throughput:
firectl create deployment <MODEL_NAME> --accelerator-count 2
More GPUs = faster generation. Note that scaling is sub-linear (2x GPUs ≠ 2x performance).

Advanced

Next steps

I