GLM 5.2 Fast is available! Opus-level intelligence at open-source rates. No contracts, pay per token. Start building.

Keep your trainer. Scale your rollouts on Fireworks.

The only dedicated rollout inference for teams running their own RL. Get elastic capacity, production-grade inference, and sub-minute weight updates.

Frontier Coding Intelligence

Cursor Improves Composer with Fireworks

Cursor uses Fireworks to generate large volumes of realistic coding trajectories for reinforcement learning. Fireworks handles the high-throughput, distributed inference underneath the training loop, so Cursor’s researchers can focus on improving Composer’s data, rewards, and behavior.

Why run rollouts on Fireworks

Your trainer stays. The rollouts run on Fireworks.

Point your rollout fleet at Fireworks through OpenAI-compatible sampling and a weight update API, and we run the inference layer for RL at scale. You own training. We run the rollouts: production-grade serving, sub-minute weight updates, and numerics built for RL.

WHAT YOU GET:

Faster, less expensive runs with optimized inference

Run rollouts with the same industry-leading Fireworks inference stack for maximum throughput.

Delta-compressed updates

Refresh rollout fleets with deltas, not full checkpoints. Typically ~2% of the model, swapped in under a minute.

Async RL, no idle GPUs

Keep trainer and rollout workers running continuously instead of waiting on each other.

Multi-region capacity

Scale across regions instead of standing up one large, co-located cluster.

Share capacity across product lines

Use capacity across product lines like RL rollouts, production inference, or training to maximize utilization.

Focus on RL, not rollout infrastructure.

.Build It YourselfFireworks
CapacityManage massive, co-located clusters for training and rollouts Elastic rollouts capacity across regions
UtilizationGPUs idle during synchronization Async RL keeps rollouts running
Weight UpdatesBuild checkpoint distribution yourself Delta-compressed updates built in
NumericsMaintain alignment yourself RL-specific sampling alignment features to help ensure model updates are verifiable.
TeamDedicated inference platform work Managed rollout infrastructure
Customer validation

Frontier teams run their rollouts on Fireworks

Cursor logo dark

“Fireworks has been an amazing partner getting our Fast Apply and Copilot++ models running performantly. They exceeded other competitors we reviewed on performance. After testing their quantized model quality for our use cases, we have found minimal degradation. Fireworks helps implement task specific speed ups and new architectures, allowing us to achieve bleeding edge performance!”

Sualeh Asif Testimonial
Sualeh Asif | CPO at Cursor
Notion logo dark

"By partnering with Fireworks to fine-tune models, we reduced latency from about 2 seconds to 350 milliseconds, significantly improving performance and enabling us to launch AI features at scale. That improvement is a game changer for delivering reliable, enterprise-scale AI."

Sarah Sachs
Sarah Sachs | AI Lead at Notion
Cursor logo dark

"I mean, the other alternative is we would build one in-house, but we have finite engineers like everybody else. We would prefer to have engineers make training more efficient and more precise rather than spin up an inference effort."

federico cassano
Federico Cassano | AI Researcher at Cursor
Harvey

“On Fireworks, combining open-source worker models with frontier tool use and post-training closes much of the gap to frontier performance on Legal Agent Benchmark, while improving cost efficiency and system controllability.”

Nico Grupen
Niko Grupen | Head of Applied Research at Harvey
Vercel Dark

"Vercel’s v0 model is a composite model. The SOTA in this space changes every day, so you don’t want to tie yourself to a single model. Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA."

Malte Ubl, CTO at Vercel
Malte Ubl | CTO at Vercel
Cursor logo dark

“Fireworks has been an amazing partner getting our Fast Apply and Copilot++ models running performantly. They exceeded other competitors we reviewed on performance. After testing their quantized model quality for our use cases, we have found minimal degradation. Fireworks helps implement task specific speed ups and new architectures, allowing us to achieve bleeding edge performance!”

Sualeh Asif Testimonial
Sualeh Asif | CPO at Cursor
Notion logo dark

"By partnering with Fireworks to fine-tune models, we reduced latency from about 2 seconds to 350 milliseconds, significantly improving performance and enabling us to launch AI features at scale. That improvement is a game changer for delivering reliable, enterprise-scale AI."

Sarah Sachs
Sarah Sachs | AI Lead at Notion
Cursor logo dark

"I mean, the other alternative is we would build one in-house, but we have finite engineers like everybody else. We would prefer to have engineers make training more efficient and more precise rather than spin up an inference effort."

federico cassano
Federico Cassano | AI Researcher at Cursor
Harvey

“On Fireworks, combining open-source worker models with frontier tool use and post-training closes much of the gap to frontier performance on Legal Agent Benchmark, while improving cost efficiency and system controllability.”

Nico Grupen
Niko Grupen | Head of Applied Research at Harvey
Close the loop with Training

Or run the whole loop on Fireworks

Not running your own trainer yet? Fireworks also runs fully managed RL and a Tinker-compatible SDK, so you can train and serve on one stack. Same kernels, same quantization, no handoff. What you train is what you serve.

FAQ

Common Questions

Can I keep my own trainer?

Yes. Keep your trainer where it is, upload checkpoints to shared object storage, and use Fireworks for rollout serving and weight-update orchestration. Bring-your-own-trainer exposes OpenAI-compatible sampling, a weight update API, and status reporting.

How fast are weight updates?

Delta-compressed updates are typically about 2% of the full model. Distribution across regions takes a few minutes, and the in-memory GPU swap stays under a minute because download and decode are pipelined ahead of the swap.

How do you handle staleness?

Async RL tolerates a bounded, predictable off-policy delay. The systems layer keeps the gap small by making weight movement a routine background operation. Your algorithm tolerates the rest.

What about MoE numerics?

The same kernels run across the trainer and rollouts. Divergences are published per model, and router replay aligns expert selection so training and sampling stay numerically consistent.

How do I get started?

Rollouts run on reserved capacity. Start with a small proof of concept to validate the integration, then scale across regions. Talk to our team to size a deployment.

Scale your RL rollouts on Fireworks

Point your trainer at production-grade rollout inference, or run the full loop on one stack.