Join the Fireworks Startups Program and unlock credits, expert support, and community to scale fast. Join here

Pricing to seamlessly scale from idea to enterprise

Start building in seconds, self-serve. Contact us for enterprise deployments with faster speeds, lower costs, and higher rate limits.

Serverless Pricing

Pay per token, with high rate limits and postpaid billing. Get started with $1 in free credits!

Text and Vision

Base model$ / 1M tokens
Less than 4B parameters$0.10
4B - 16B parameters$0.20
More than 16B parameters$0.90
MoE 0B - 56B parameters (e.g. Mixtral 8x7B)$0.50
MoE 56.1B - 176B parameters (e.g. DBRX, Mixtral 8x22B)$1.20
DeepSeek V3$0.90
DeepSeek R1 (Fast)$3.00 input, $8.00 output
DeepSeek R1 0528 (Fast)$3.00 input, $8.00 output
GLM-4.5 and DeepSeek R1 (Basic)$0.55 input, $2.19 output
Meta Llama 3.1 405B$3.00
Meta Llama 4 Maverick (Basic)$0.22 input, $0.88 output
Meta Llama 4 Scout (Basic)$0.15 input, $0.60 output
Qwen3 235B Family and GLM-4.5 Air$0.22 input, $0.88 output
Qwen3 30B and Qwen Coder Flash$0.15 input, $0.60 output
Kimi K2 Instruct$0.60 input, $2.50 output
Qwen3 Coder 480B$0.45 input, $1.80 output
OpenAI gpt OSS 120b$0.15 input, $0.60 output
OpenAI gpt OSS 20b$0.07 input, $0.30 output

Discounts for prompt caching are available for enterprise deployments. Contact us to learn more.

Batch inference is priced at 50% of our serverless pricing for both input and output tokens. Learn more here.

Speech to Text (STT)

Pay per second of audio input

Model$ / audio minute (billed per second)
Whisper-v3-large$0.0015
Whisper-v3-large-turbo$0.0009
Streaming ASR v1$0.0032
Streaming ASR v2$0.0035
  • Diarization adds a 40% surcharge to pricing
  • Batch API prices are reduced 40%

Image Generation

Image model name$ / step Approx $ / image
All Non-Flux Models (SDXL, Playground, etc)$0.00013 per step ($0.0039 per 30 step image)$0.0002 per step ($0.006 per 30 step image)
FLUX.1 [dev]$0.0005 per step ($0.014 per 28 step image)N/A on serverless
FLUX.1 [schnell]$0.00035 per step ($0.0014 per 4 step image)N/A on serverless
FLUX.1 Kontext Pro$0.04 per imageN/A
FLUX.1 Kontext Max$0.08 per imageN/A

All models besides the Flux Kontext models are charged by the number of inference steps (denoising iterations). The Flux Kontext models are charged a flat rate per generated image.

Embeddings

Base model parameter count$ / 1M input tokens
up to 150M$0.008
150M - 350M$0.016

Fine Tuning Pricing

Pay per training token. Inference for fine-tuned models costs the same as the base models

Fine Tuning

Base Model$ / 1M training tokens
Models up to 16B parameters$0.50
Models 16.1B - 80B$3.00
Models 80B - 300B (e.g. Qwen3-235B, gpt-oss-120B)$6.00
Models >300B (e.g. DeepSeek V3, Kimi K2)$10.00

Fine-tuning with images (VLM supervised fine-tuning) is billed per token as well. See this FAQ on calculating image tokens.

There is no additional cost for storing LoRA fine-tunes up to the quota for an account.

On-Demand Pricing

Pay per GPU second, with no extra charges for start-up times

On demand deployments

GPU Type$ / hour (billed per second)
A100 80 GB GPU$2.90
H100 80 GB GPU$4.00
H200 141 GB GPU$6.00
B200 180 GB GPU$9.00
AMD MI300X$4.99

For estimates of per-token prices, see this blog. Results vary by use case, but we often observe improvements like ~250% higher throughput and 50% faster speed on Fireworks compared to open source inference engines.