Fireworks Inference

The highest performance inference for your specialized intelligence.

Serve frontier open models, or your own post-trained versions of them, on an engine optimized at every layer.

Run a model

Talk to our team

NVIDIA GTC 2026, Jensen Huang

The infrastructure layer for continuously improving specialized intelligence.

Every layer optimized for production

Best performing inference stack

A fully disaggregated inference engine optimized from custom kernels to memory management. Deliver up to 4x higher throughput and industry-leading latency without sacrificing model quality.

Built for any scale

From solo developers to Cursor-scale workloads. Long prompts, agent loops, and multi-turn conversations run fast and reliably, even as traffic, context length, and concurrency grow.

The latest models

Access to hundreds of models and the latest open-source frontier releases as soon as they become available.

Three ways to run

Pick how you want to serve

Run a model

Talk to our team

Serverless

Call a model with one request, pay per token. OpenAI and Anthropic compatible, so you can migrate by swapping a URL.

On-Demand

Dedicated deployments for serving post-trained specialized models with support for multi-region deployments and custom performance optimizations.

Reserved Capacity

Get guaranteed capacity, higher quotas, early access to the newest regions and hardware, and more. Supports increased deployment flexibility with BYOC.

When performance is the product

A disaggregated engine, optimized end to end

Most providers run a general-purpose serving stack and tune at the edges. We built it from the ground up, optimized every layer we control, from GPU memory layout to the runtime, and disaggregated every stage so each one scales on its own.

Read the docs

Disaggregated prefill and decode

Prefill and decode have different hardware profiles. We split them onto separate pools so each scales independently, cutting latency without stranding GPUs.

Optimizations from top to bottom

Inference optimizations including custom kernels, precision and quantization innovations, speculative decoding, and advanced caching strategies.

KV cache and routing for long context

Disaggregated KV caching and prompt-aware routing keep long prompts and multi-turn sessions fast.

Multi-node expert parallelism

Frontier MoE models served across nodes with composable parallelism, so trillion-parameter models run at production speed.

Day zero, every modality

Run the newest open models the day they launch

As an official launch partner for every leading model provider, we ship new models with day-zero support and model-specific optimizations. No waiting months for fast serving.

Browse the model library

Serve the intelligence you build

The model you train on Fireworks serves on the same stack, with the same kernels and quantization. What you train is what you serve, with no handoff and no migration. Inference and training feed each other in one loop.

Explore Training

Leading AI companies serve on Fireworks

"Fireworks' Multi-LoRA capabilities align with Cresta's strategy to deploy custom AI through fine-tuning cutting-edge base models. It helps unleash the potential of AI on private enterprise data."

Tim Shi | Co-Founder at Cresta

“Using Fireworks AI on Foundry, we can run repeatable, high-volume evaluations through a single Azure endpoint, which helps our team move faster from deployment to informed model decisions with more confidence.”

Hanbin Jung | Partnership Lead at Motif

why did Cursor rollout Composer 2 with @FireworksAI_HQ?

"...because it's way more performant than the open source engines and is what we use in production. our rl inference scales elastically and globally because of it. when we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

Federico Cassano | AI Researcher at Cursor

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA."

Malte Ubl | CTO at Vercel

"By partnering with Fireworks to fine-tune models, we reduced latency from about 2 seconds to 350 milliseconds, significantly improving performance and enabling us to launch AI features at scale. That improvement is a game changer for delivering reliable, enterprise-scale AI."

Sarah Sachs | AI Lead at Notion

"Fireworks enabled us to own our AI journey, and unlock better quality in just four weeks."

Kay Zhu | CTO at Genspark

"We've had a really great experience working with Fireworks to host open source models, including SDXL, Llama, and Mistral. After migrating one of our models, we noticed a 3x speedup in response time, which made our app feel much more responsive and boosted our engagement metrics."

Spencer Chan | Product Lead at Quora

"Fireworks has been a fantastic partner in building AI dev tools at Sourcegraph. Their fast, reliable model inference lets us focus on fine-tuning, AI-powered code search, and deep code context, making Cody the best AI coding assistant. They are responsive and ship at an amazing pace."

Beyang Liu | CTO at Sourcegraph

By running Fireworks AI on Azure Foundry, UiPath powers both Autopilot and Delegate with open models that are significantly faster and more cost-efficient for Computer Use, all while matching the quality of Claude's Sonnet 4.6. It's a step-change in how we deliver AI at scale to our customers.

Mircea Neagovici-Negoescu | SVP, Head of AI at UiPath

"Fireworks has been a key partner in helping us train and serve the models behind Cursor at scale. Their platform supports the high-throughput RL workloads and production inference required for Composer, giving us the speed, reliability, and efficiency to keep pushing the frontier of AI coding."

Sualeh Asif | CPO at Cursor

"Fireworks enabled us to own our AI journey, and unlock better quality in just four weeks. This resulted in a better user experience for our customers."

Kay Zhu | CTO at Genspark

"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."

Kyle Montgomery & Sijun Tan | Core Contributors, rLLM at rLLM

"Fireworks' Multi-LoRA capabilities align with Cresta's strategy to deploy custom AI through fine-tuning cutting-edge base models. It helps unleash the potential of AI on private enterprise data."

Tim Shi | Co-Founder at Cresta

“Using Fireworks AI on Foundry, we can run repeatable, high-volume evaluations through a single Azure endpoint, which helps our team move faster from deployment to informed model decisions with more confidence.”

Hanbin Jung | Partnership Lead at Motif

why did Cursor rollout Composer 2 with @FireworksAI_HQ?

"...because it's way more performant than the open source engines and is what we use in production. our rl inference scales elastically and globally because of it. when we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

Federico Cassano | AI Researcher at Cursor

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA."

Malte Ubl | CTO at Vercel

FAQ

Is Fireworks OpenAI and Anthropic compatible?

Yes. Fireworks is compatible with both, making it easy to migrate applications that use OpenAI or Anthropic’s Messages API.

How fast do new models become available?

As an official launch partner for all major model providers, we often provide day zero support for major launches.

What models and modalities are supported?

Qwen, Kimi, DeepSeek, MiniMax, Nemotron, GLM, Llama, and many more across text and vision language models. The full list of 250+ models is in the model library.

How do you handle uptime and security?

Production-grade reliability with autoscaling across regions. SOC 2 Type II, HIPAA-ready, and GDPR-compliant. We never use your inference inputs or outputs for any other purpose. Full security posture is in the docs.

Serve your model on Fireworks

Start serverless in minutes, or talk to our team about on-demand and reserved capacity.

Run a model

Talk to our team

The highest performance inference for your specialized intelligence.

The infrastructure layer for continuously improving specialized intelligence.

Every layer optimized for production

Best performing inference stack

Built for any scale

The latest models

Pick how you want to serve

Serverless

On-Demand

Reserved Capacity

A disaggregated engine, optimized end to end

Disaggregated prefill and decode

Optimizations from top to bottom

KV cache and routing for long context

Multi-node expert parallelism

Run the newest open models the day they launch

Deepseek v3.2

GLM 5.2

Kimi K2.7 Code

Minimax M3

Qwen3.7 Plus

DeepSeek-V4-Pro

DeepSeek-V4-Flash

Kimi K2.6

GLM 5.1

Gemma 4 31B IT NVFP4

Gemma 4 26B A4B IT

Qwen3.6 Plus

MiniMax M2.7

OpenAI gpt-oss-20b

FLUX.1 Kontext Pro

Whisper V3 Large

Deepseek R1 05/28

Kimi K2.5

Deepseek v3.2

GLM 5.2

Kimi K2.7 Code

Minimax M3

Serve the intelligence you build

Leading AI companies serve on Fireworks

FAQ

Serve your model on Fireworks