DeepSeek V4 Pro is Live → Try it now.

NOW AVAILABLE IN PREVIEW

RL Rollouts at Frontier Speed & Scale

RL is ~80% inference. Run yours on the Fireworks Training Platform.

___________________________________________

In synchronous RL, rollouts consume 70–80% of wall-clock time. Inference, not the trainer, is the dominant lever on iteration speed and time to production.

Fireworks operates one of the fastest inference platforms in production — the same stack behind Cursor's Composer 2 — and exposes it as a rollout fleet you can plug any trainer into.

Vercel Dark

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl, CTO at Vercel
Malte Ubl | CTO at Vercel
rLLM

"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."

rLLM
Kyle Montgomery & Sijun Tan | Core Contributors, rLLM at rLLM
Cursor logo dark
why did Cursor rollout Composer 2 with @FireworksAI_HQ?

"...because it's way more performant than the open source engines and is what we use in production. our rl inference scales elastically and globally because of it. when we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

federico cassano
Federico Cassano | AI Researcher at Cursor
Vercel Dark

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl, CTO at Vercel
Malte Ubl | CTO at Vercel
rLLM

"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."

rLLM
Kyle Montgomery & Sijun Tan | Core Contributors, rLLM at rLLM
Cursor logo dark
why did Cursor rollout Composer 2 with @FireworksAI_HQ?

"...because it's way more performant than the open source engines and is what we use in production. our rl inference scales elastically and globally because of it. when we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

federico cassano
Federico Cassano | AI Researcher at Cursor
Vercel Dark

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl, CTO at Vercel
Malte Ubl | CTO at Vercel
Production-scale Inference Infra

RL Rollouts - At a Glance

  • Live in production today, powering Cursor's Composer 2
  • 1T-parameter full-parameter rollouts on B200/B300
  • Bring any trainer: open-source or in-house
  • Same kernels in training and serving — checkpoint to live endpoint in seconds

Train where you want. Roll out on Fireworks.

Whether you want a fully managed workflow or a Tinker-compatible setup, Fireworks can power the rollout layer: Bring your trainer. Fireworks handles rollout inference, policy updates, and fleet orchestration.

Why Choose Fireworks for RL Rollouts

How It Works

  1. Train with your preferred stack.
  2. Publish checkpoints to shared object storage.
  3. Signal Fireworks when a new policy is ready.
  4. Fireworks hot-loads the policy across the rollout fleet.
  5. Continue sampling through standard OpenAI-compatible APIs. Multi-turn rollouts get KV cache reuse across turns via session affinity headers. Choose your sync vs. async tradeoff; staleness stays bounded and predictable.

With RL Rollouts on Fireworks, your rollout fleet runs at production-inference speed from day one.

GET STARTED

Ready to run RL at scale?

Tell us about your trainer, model size, and current rollout latency → we will size the speedup and help design the right architecture for your workload.

FAQ

Common Questions

Can I use my own RL trainer (e.g., PyTorch FSDP, Megatron, Ray) with Fireworks for rollouts?

Yes. Fireworks' Bring Your Own Trainer (BYOT) integration lets you run your own training loop while offloading large-scale inference to Fireworks. You create a hot-load deployment pointed at your external bucket (S3, MinIO, Nebius), upload checkpoints on your own cadence, signal Fireworks when a new snapshot is ready, and run rollouts via the standard OpenAI-compatible API. Fireworks handles the distributed weight swap, KV cache management, and inference serving. Note that this is currently an early access feature; contact Fireworks to enable it on your account.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-integration

Do I need to upload a full model checkpoint every training step?

No, and you shouldn't. The recommended approach is to upload a full HuggingFace-format checkpoint for the first step and then every 20th or 30th step after that. For all intermediate steps, publish an incremental snapshot using the ARC2 (arc_v2) format, which diffs against the currently loaded snapshot. Incremental snapshots significantly reduce both upload time and load time during training. If an incremental hot-load fails, fall back to a new full snapshot.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-delta-checkpoints

What happens to in-flight inference requests when Fireworks swaps to a new checkpoint?

It depends on which transition mode is configured on your deployment. The default for RL is async transition: in-flight requests are paused during the weight swap and then resumed on the same HTTP connection with their KV state intact, so they continue streaming rather than restarting. New requests are queued until the swap completes, which may show up as elevated time-to-first-token. No 4xx/5xx errors are returned for the swap itself, though you can set the x-fireworks-hot-load-drain-timeout request header (default 90 seconds) to receive an HTTP 425 if the swap exceeds that window. The alternative is synchronous transition, where in-flight requests finish on the old weights before the swap begins, and new requests receive HTTP 425 until the swap is done.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging

How do I check what weights my deployment is currently serving, or recover from a stuck state?

Run firectl get ledger <deployment_id> to dump the full snapshot history. Each row shows the snapshot identity, whether it was a full or incremental load, per-replica readiness timestamps, and any load errors. If the deployment itself is unhealthy (e.g., OOM during merge, crash loop), run firectl deployment get <deployment_id> and check status and latestStatus.reason alongside the ledger. If the delta chain is wedged, you can reset the ledger entirely via a DELETE call to the ledger endpoint. This clears server-side history without deleting the deployment. After a reset, your next signal must be a full snapshot since there is nothing to diff against.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging

How does prompt cache behave across checkpoint swaps in a multi-turn rollout session?

You control this per snapshot via the reset_prompt_cache field in the hot-load signal request. The current default (all) refills prompt cache broadly after the swap. Setting it to new_session preserves the cache namespace for existing multi-turn session IDs while new sessions refill. Setting it to none preserves prompt cache state entirely across the swap. This field only affects what can be reused after the swap; it does not interrupt the active turn in an in-flight request.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging