Excited to launch a multi-year partnership bringing Fireworks to Microsoft Azure Foundry! Learn more

Blog
Frontier RL Is Cheaper Than You Think

Frontier RL Is Cheaper Than You Think

The conventional wisdom on RL infrastructure is wrong. Cross-region rollouts, compact deltas, and hot-load updates let teams use distributed capacity effectively instead of waiting for one mega cluster.

CROSS-REGION RL WEIGHT UPDATE LOOPPOLICY TRAINER -> SHARED CHECKPOINTS -> GLOBAL ROLLOUT FLEETpolicy trainerforward/backward + optim stepTRAINER OUTPUTbase checkpoint every N stepsexample here: N = 25changed weights per step: ~2%active checkpoint:policy-model-step-0001-fullweight update handofffull checkpoints + compact deltasACTIVE TRANSFERstep 1checkpoint type: full checkpoint100.0% of full weightseach delta is ~98.0% smaller than a full copy1024.0 GiB transferredUS OHIO ROLLOUT100%full weight refresh+43 ms inter-region latency+43 msUS VIRGINIA ROLLOUT100%full weight refresh+58 ms inter-region latency+58 msEU FRANKFURT ROLLOUT100%full weight refresh+145 ms inter-region latency+145 ms>98% less traffictypical delta ~2%50-STEP SAMPLE WINDOWfull checkpoint every N steps, example shown here uses N = 251full100%5101520253035404550full checkpoint resets the chaintypical delta: about 2% of fulleach delta is more than 98% smallerthan a full copy

Frontier RL Is Cheaper Than You Think

The conventional wisdom on RL infrastructure is wrong, and it is costing teams that could be competing at the frontier.

The entire mega-cluster narrative rests on a single assumption: that you have to ship 1 TB of weights every time you update your rollout fleet. You do not.

Researchers have spent the last year writing about asynchronous RL and rollout-training disaggregation in systems like AReaL. Teams like Kimi and MiniMax have also published engineering notes on RL parameter updates and asynchronous scheduling. We have been running that pattern in production.

That mega-cluster instinct comes from pretraining, where the main systems problem is keeping one huge synchronous training job saturated. RL is a different problem. The question is not just how to run the trainer. It is also how to keep a large rollout fleet generating data from a fresh enough policy without constantly stalling on full checkpoint transfers.

RL infrastructure in 30 seconds

An RL training run has two jobs:

  • The trainer does forward pass, reward computation, backward pass, and parameter updates.
  • The rollout fleet samples trajectories from the current policy, i.e. runs inference on the latest updated model.

The trainer needs dense, tightly coupled hardware. The rollout fleet needs inference throughput across many parallel requests. Pretraining only has the first job. RL has both, which is why the infrastructure question is different.

The 1 TB problem

A typical frontier checkpoint is around 1 TB. If every policy refresh required shipping that full checkpoint to the rollout fleet, then the natural conclusion would be that RL needs one giant co-located cluster with RDMA-class internal networking. Keep trainer and inference on the same fabric, avoid long-distance transfers, and treat remote capacity as second class.

That is the mega-cluster story. It makes frontier RL look like a market only a handful of companies can enter, because everyone else gets boxed out by infrastructure economics before they even get to compete on algorithms or product execution.

But the premise is wrong. You do not need to move the full 1 TB on every update.

The key insight: exploiting 98% sparsity

Between nearby RL checkpoints, most weights change only a little. That makes it practical to send a compressed delta against the previous checkpoint instead of sending the full 1 TB again.

Last year, we empirically observed that more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints, and the unchanged fraction is even higher at lower precision. Our intuition was that post-training updates are extremely fine-grained and RL provides very sparse information signal with just a few bits per rollout. In practice that means RL training uses a fairly small learning rate, and most parameters move only slightly in fp32. Those changes often do not cross the threshold required to alter their 16-bit or lower-precision representation. A recently published paper, Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL, provides a theoretical foundation for the same phenomenon and reports similarly high sparsity, often around 99% in practical RL settings.

In the sample setup behind this post, a full checkpoint is 1024 GiB. The average delta between adjacent checkpoints is 20.3 GiB, or 1.98% of the full model. Over the 50-step window shown below, that cuts cross-region transfer volume by about 94% compared to moving the full model every time.

FULL CHECKPOINT EVERY N STEPSEXAMPLE SHOWN HERE USES N = 2550-STEP SAMPLE WINDOWgrey bars show the cost of shipping a full checkpoint every step15101520253035404550100%active step ships a full checkpoint100.00% of full weights1024.0 GiB transferredWHAT THE DATA GENERATOR IS MODELINGseeded sample for this posttypical delta: about 2% of fulleach delta is more than 98% smaller than a full copyfull = orangedelta = purple
Checkpoint CadenceThis visual shows the intended RL update rhythm: a periodic full checkpoint, then delta-compressed weight updates in between, so the rollout fleet usually receives only about 2% of the full model.

The cadence looks like this: publish a full base checkpoint every N steps, then ship compressed deltas in between. The compression focuses on sending only the changed weights, with checksummed reconstruction so every rollout cluster can rebuild the exact checkpoint losslessly from shared storage.

DELTA-COMPRESSED WEIGHT UPDATESCHANGED WEIGHTS -> COMPACT PAYLOAD -> EXACT RECONSTRUCTION1. IDENTIFY CHANGED WEIGHTSadjacent checkpoints differ in a few chunks, not everywherePREVCURRENTDELTASIZEembed_tokens00attn.q_proj0.2%attn.o_proj0.4%mlp.gate_up0.7%mlp.down_proj0.2%final_norm0.1%Unchanged chunks stay out of the transfer.Only sparse changed-weight slices survive into the payload.2. PACKAGE CHANGED TENSORSkeep the changed pieces, bit-pack them, then attach reconstruction metadatachanged tensorchunks onlybit-pack+ encodedeltapayloadCHANGED-WEIGHT BUFFERENCODED OUTPUTMETADATAprev_snapshot_idtensor_checksumshape + dtypeSparse changed-weight data plus metadata becomes the payload.That is enough for each rollout cluster to reconstruct and verify exactly.3. RECONSTRUCT AND SWAP WEIGHTSfetch the previous snapshot, decode, apply delta, verify, then swap weights in placefetch prev snapshotdecode payloadapply deltaverify checksumswap weights
Delta-Compressed Weight UpdatesThis animation shows the weight-update flow: identify changed weights, encode only those deltas into a compact payload, and reconstruct the next checkpoint exactly on the rollout side.

The point is that delta-compressed weight updates make cross-region synchronization practical over ordinary network links, without requiring trainer and rollout inference to share one RDMA fabric.

Async RL and pipelining

Asynchronous RL (also often called Pipeline RL) explicitly trades being a little off-policy for much better compute efficiency. Idle samplers are expensive, and a little policy staleness is often acceptable if it keeps training and rollout generation overlapped.

That tradeoff only works if policy updates move quickly enough. Delta-compressed weight updates make it practical by keeping the handoff small: distributing a new checkpoint across globally distributed rollout clusters takes only a few minutes end to end, which bounds the overall off-policy delay. The actual weight swap in GPU memory can stay well under a minute because most of the work, especially download and decompression, is pipelined ahead of the swap itself.

Trainer implementation is pipelined too. Every training step can upload updated weights to shared object storage, with each rank caching its previous upload and transmitting only the diff against the new weights. Upload is sharded across training GPUs, download is sharded across inference replicas, and compression plus transfer plus signaling are pipelined in background so training is never blocked.

POLICY FRESHNESS TIMELINEASYNC WEIGHT UPDATES KEEP THE FLEET WARM; FULL RESTARTS INSERT GAPSPOLICY VERSIONS OVER TIMEv41v42v43load v42load v43ASYNC RL FLEETreplicas stay warmGPU swap <1 minFULL RESTARTserving pausesgapgaprestart gapSmall weight updates keep overall off-policy delay to only a few minutes.The in-memory swap itself stays under a minute, which keeps distributed rollout capacity useful for RL.usable RL capacity
Policy Freshness TimelineThe real payoff is policy freshness: asynchronous RL tolerates a few minutes of off-policy delay, while the actual GPU-memory weight swap stays under a minute when download and decode are pipelined ahead of time.

The practical payoff is straightforward: less time waiting on checkpoint movement during synchronization, and more throughput generating rollouts on fresher weights.

A note on staleness

Running trainer and rollout fleet asynchronously means the fleet is always serving a policy that is a few steps behind the trainer. That gap is called staleness, and it is a real tradeoff worth naming directly.

Our systems layer does not eliminate staleness. The algorithm has to tolerate some off-policy data. What the systems layer can do is keep the gap bounded and predictable. Delta-compressed updates do that by shrinking policy movement into a routine background operation rather than a stop-the-world event.

Multi-region rollout capacity

This is where the systems point becomes strategic. Most teams do not have one perfectly contiguous giant cluster sitting idle for rollouts. They have capacity scattered across regions, clouds, and availability zones, and assembling one giant sampler fleet in one place is often tedious even if the aggregate GPU count exists.

Once weight updates are small, that fragmented capacity becomes usable for RL instead of stranded. Each rollout cluster can independently download and reconstruct weights from the same shared delta chain, without any direct connection back to the training cluster.

This is not just a hypothetical systems design. Fireworks used this architecture to support Cursor's Composer 2 training run. Federico Cassano wrote that the Composer 2 RL run was "distributed across 3 (sometimes 4) different clusters around the world." (Federico Cassano on X)

That makes the rollout fleet elastic. You can add clusters, remove clusters, or rebalance across regions as capacity and cost change, while keeping all of them pointed at the same stream of policy updates. In Fireworks Virtual Cloud, those clusters are abstracted behind a single control plane so they appear as one uniform capacity pool rather than a set of separate regional deployments that the user has to manage individually.

When this approach matters less

This architecture is not the right answer in every RL setup.

  • If the model is small enough that trainer and rollout inference comfortably fit on one node or one compact cluster, bandwidth is not the main bottleneck and the simpler setup usually wins.
  • If checkpoints are emitted so frequently that the delta pipeline cannot distribute and apply one update before the next one matters, then staleness becomes the limiting factor and tighter co-location can make more sense.
  • If the rollout stack does not cleanly separate inference from training, then treating rollout workers like a standard inference fleet is a poor fit and the disaggregated design gets less attractive.

For frontier-scale models where those constraints do not hold, this setup is the practical way to turn fragmented capacity into usable RL throughput.

The Fireworks training platform

Fireworks supports three ways to run RL:

  • Fully managed RL: provide the dataset and evaluators, and Fireworks runs the trainer, rollout inference, checkpointing, and weight-update workflow.
  • Tinker-compatible SDK: customize the training flow while still hosting the GPU-heavy work for both training and inference on Fireworks.
  • Bring your own trainer: keep the trainer where it is, upload checkpoints to shared object storage, and use Fireworks for rollout serving and weight-update orchestration.

In the bring-your-own-trainer setup, the interface is still a normal inference deployment, with a few RL-specific additions:

  • OpenAI-compatible sampling endpoints for rollouts
  • a weight update API to signal that a new checkpoint is available
  • status reporting for update progress
  • sampling features for training-sampling numerical alignment, for example returning selected experts in MoE layers to implement router replay
  • fine-grained control of prompt caching behavior during weight updates

The common requirement is the same in every deployment model: make model updates small, verifiable, and routine.

Giant co-located clusters can be the right tool for synchronous pretraining. That does not mean RL rollouts have to live on one giant co-located cluster. The important question is whether your system can keep a distributed sampler fleet updated cheaply and reliably enough to use whatever capacity is available.

If you want to explore that setup, start with the Fireworks Training SDK introduction, which covers the Tinker-compatible control loop, checkpointing, weight updates, and training/sampling workflow.

If you want to talk through a rollout architecture or pair a Tinker-style trainer with Fireworks-hosted rollouts, email [email protected] or reach out on Discord.