NOW AVAILABLE IN PREVIEW

RL Rollouts at Frontier Speed & Scale

RL is ~80% inference. Run yours on the Fireworks Training Platform.

___________________________________________

In synchronous RL, rollouts consume 70–80% of wall-clock time. Inference, not the trainer, is the dominant lever on iteration speed and time to production.

Fireworks operates one of the fastest inference platforms in production — the same stack behind Cursor's Composer 2 — and exposes it as a rollout fleet you can plug any trainer into.

Contact us to Get Started

More Details in the Docs

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl | CTO at Vercel

"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."

Kyle Montgomery & Sijun Tan | Core Contributors, rLLM at rLLM

why did Cursor rollout Composer 2 with @FireworksAI_HQ?

"...because it's way more performant than the open source engines and is what we use in production. our rl inference scales elastically and globally because of it. when we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

Federico Cassano | AI Researcher at Cursor

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl | CTO at Vercel

"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."

Kyle Montgomery & Sijun Tan | Core Contributors, rLLM at rLLM

why did Cursor rollout Composer 2 with @FireworksAI_HQ?

"...because it's way more performant than the open source engines and is what we use in production. our rl inference scales elastically and globally because of it. when we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

Federico Cassano | AI Researcher at Cursor

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl | CTO at Vercel

Production-scale Inference Infra

RL Rollouts - At a Glance

•Live in production today, powering Cursor's Composer 2
•1T-parameter full-parameter rollouts on B200/B300
•Bring any trainer: open-source or in-house
•Same kernels in training and serving — checkpoint to live endpoint in seconds

Get started

Train where you want. Roll out on Fireworks.

Whether you want a fully managed workflow or a Tinker-compatible setup, Fireworks can power the rollout layer: Bring your trainer. Fireworks handles rollout inference, policy updates, and fleet orchestration.

Why Choose Fireworks for RL Rollouts

Compress the 80% that dominates your RL loop

Fireworks runs production inference for Cursor's Composer 2 and is benchmarked by Artificial Analysis as the fastest provider on the leading open-weight models. FireAttention (our custom inference engine) provides throughput and latency optimizations, speculative decoding, FP8/MXFP8, and adaptive caching to your rollout fleet. Teams moving rollouts to Fireworks consistently report large end-to-end speedups versus the inference engines bundled with most training stacks.

Train-inference numerical parity (requires training + RL rollouts on Fireworks)

Training and inference run on the same kernels and the same hardware. We publish K3 KL divergence between training and inference checkpoints for every model in the catalog, and FP8 QAT mirrors the inference engine's math exactly. A trained checkpoint becomes a live endpoint in seconds — no format conversion, no stack migration. The numerical edge cases that surface in frontier MoE models aren't hypothetical to us; we debug them in production.

Keep policies fresh without stop-the-world reloads

Lossless delta compression cuts policy-update bandwidth >20x. Cross-region policy refresh becomes practical: end-to-end policy distribution across a globally distributed fleet runs in a few minutes, with the GPU-memory weight swap itself under a minute. Async RL stays within a bounded, predictable freshness window.

Scale rollouts across distributed capacity

RL rollouts need throughput across many parallel requests. Fireworks turns capacity across regions into a single rollout fleet. Because training and inference share one platform, you can elastically allocate capacity to training during off-peak inference hours.

No trainer lock-in

Fireworks plugs in as the rollout layer without forcing a trainer migration. Keep what works. Swap when you want. No algorithmic opinions baked in.

Predictable per-GPU-hour pricing

RL workloads are spiky and rollout-heavy. Per-GPU-hour pricing keeps cost predictable across long training runs — no per-token surprises.

How It Works

Train with your preferred stack.
Publish checkpoints to shared object storage.
Signal Fireworks when a new policy is ready.
Fireworks hot-loads the policy across the rollout fleet.
Continue sampling through standard OpenAI-compatible APIs. Multi-turn rollouts get KV cache reuse across turns via session affinity headers. Choose your sync vs. async tradeoff; staleness stays bounded and predictable.

With RL Rollouts on Fireworks, your rollout fleet runs at production-inference speed from day one.

FROM THE BLOG

Learnings from RL Leaders

3/28/2026

The Fine-Tuning Bottleneck Isn't the Algorithm

Developer Experience3/22/2026

Frontier RL Is Cheaper Than You Think

Developer Experience3/10/2024

Training-Inference Parity in MoE Models: Where Numerics Drift

GET STARTED

Ready to run RL at scale?

Tell us about your trainer, model size, and current rollout latency → we will size the speedup and help design the right architecture for your workload.

Get Started Today

FAQ

Common Questions

Can I use my own RL trainer (e.g., PyTorch FSDP, Megatron, Ray) with Fireworks for rollouts?

Yes. Fireworks' Bring Your Own Trainer (BYOT) integration lets you run your own training loop while offloading large-scale inference to Fireworks. You create a hot-load deployment pointed at your external bucket (S3, MinIO, Nebius), upload checkpoints on your own cadence, signal Fireworks when a new snapshot is ready, and run rollouts via the standard OpenAI-compatible API. Fireworks handles the distributed weight swap, KV cache management, and inference serving. Note that this is currently an early access feature; contact Fireworks to enable it on your account.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-integration

Do I need to upload a full model checkpoint every training step?

No, and you shouldn't. The recommended approach is to upload a full HuggingFace-format checkpoint for the first step and then every 20th or 30th step after that. For all intermediate steps, publish an incremental snapshot using the ARC2 (arc_v2) format, which diffs against the currently loaded snapshot. Incremental snapshots significantly reduce both upload time and load time during training. If an incremental hot-load fails, fall back to a new full snapshot.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-delta-checkpoints

What happens to in-flight inference requests when Fireworks swaps to a new checkpoint?

It depends on which transition mode is configured on your deployment. The default for RL is async transition: in-flight requests are paused during the weight swap and then resumed on the same HTTP connection with their KV state intact, so they continue streaming rather than restarting. New requests are queued until the swap completes, which may show up as elevated time-to-first-token. No 4xx/5xx errors are returned for the swap itself, though you can set the x-fireworks-hot-load-drain-timeout request header (default 90 seconds) to receive an HTTP 425 if the swap exceeds that window. The alternative is synchronous transition, where in-flight requests finish on the old weights before the swap begins, and new requests receive HTTP 425 until the swap is done.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging

How do I check what weights my deployment is currently serving, or recover from a stuck state?

Run firectl get ledger <deployment_id> to dump the full snapshot history. Each row shows the snapshot identity, whether it was a full or incremental load, per-replica readiness timestamps, and any load errors. If the deployment itself is unhealthy (e.g., OOM during merge, crash loop), run firectl deployment get <deployment_id> and check status and latestStatus.reason alongside the ledger. If the delta chain is wedged, you can reset the ledger entirely via a DELETE call to the ledger endpoint. This clears server-side history without deleting the deployment. After a reset, your next signal must be a full snapshot since there is nothing to diff against.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging

How does prompt cache behave across checkpoint swaps in a multi-turn rollout session?

You control this per snapshot via the reset_prompt_cache field in the hot-load signal request. The current default (all) refills prompt cache broadly after the swap. Setting it to new_session preserves the cache namespace for existing multi-turn session IDs while new sessions refill. Setting it to none preserves prompt cache state entirely across the swap. This field only affects what can be reused after the swap; it does not interrupt the active turn in an in-flight request.

More info: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging