New to deployments? Start with our On-Demand Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.
- Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
- No hard rate limits – Only limited by your deployment’s capacity
- Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
- Broader model selection – Access models not available on serverless
- Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere
Creating & querying deployments
Create a deployment:You can find your deployment name anytime with
firectl list deployments
and firectl get deployment <DEPLOYMENT_ID>
.- Fireworks model
- Custom model
- Model:
accounts/fireworks/models/mixtral-8x7b
- Deployment:
accounts/alice/deployments/12345678
You can also use shorthand:
fireworks/mixtral-8x7b#alice/12345678
Code examples
- Python
- JavaScript
- curl
Deployment shapes
Deployment shapes are the primary way to configure deployments. They’re pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other performance factors.- Fast – Low latency for interactive workloads
- Throughput – Cost-per-token at scale for high-volume workloads
- Minimal – Lowest cost for testing or light workloads
Need even better performance with tailored optimizations? Contact our team.
Managing & configuring deployments
Basic management
By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.
GPU hardware
Choose GPU type with--accelerator-type
:
NVIDIA_A100_80GB
NVIDIA_H100_80GB
NVIDIA_H200_141GB
Autoscaling
Control replica counts, scale timing, and load targets for your deployment. See the Autoscaling guide for configuration options.Multiple GPUs per replica
Use multiple GPUs to improve latency and throughput:Advanced
- Speculative decoding - Speed up text generation using draft models or n-gram speculation
- Quantization - Reduce model precision (e.g., FP16 to FP8) to improve speeds and reduce costs by 30-50%
- Managing default deployments - Control which deployment handles queries when using just the model name
- Publishing deployments - Make your deployment accessible to other Fireworks users
Next steps
Autoscaling
Configure autoscaling for optimal cost and performance
Upload custom models
Deploy your own models from Hugging Face
Quantization
Reduce costs with model quantization
Regions
Choose deployment regions for optimal latency
Reserved capacity
Purchase reserved GPUs for guaranteed capacity
Fine-tuning
Fine-tune models for your specific use case