On this page
There is a growing assumption in AI that training a frontier model is a king's game: only the few players who can afford one giant cluster get to compete.
That framing has some force in pretraining. Systems like Colossus are built to keep a huge synchronous training job saturated.
But it does not transfer cleanly to RL. In RL, the competitive question is not just "who owns the biggest cluster?" It is "who can keep a rollout fleet on the freshest policy without wasting time and capacity on full reloads?"
If policy refresh can be done through periodic full snapshots, compact deltas, and hot-load updates across regions, then teams can use distributed capacity much more effectively. That lowers the infrastructure bar and makes frontier training less monopolistic than the giant-supercluster story suggests.
If RL is treated as a giant-cluster problem, then frontier training starts to look like a market only a few companies can enter.
That is the real concern behind the hype around mega clusters: a handful of players control the only viable path to frontier models, and everyone else gets boxed out by infrastructure economics before they get to compete on algorithm quality or product execution.
Cross-region RL changes that equation. If rollout fleets can stay warm, policy updates can be hot-loaded, and distributed inference capacity can be used efficiently, then more teams can compete seriously. Fireworks pushes in that direction by making cross-region RL practical with stable inference APIs, faster policy refresh, admission continuity, and prompt-cache efficiency while training and rollout inference run asynchronously.
In RL, the trainer and the rollout fleet play different roles:
If every policy refresh is a full checkpoint copy, then RL starts to inherit the same giant-cluster economics.
But that is the wrong unit of movement.
Cross-region RL does not require shipping optimizer state, activations, or a full set of model weights on every step. The rollout fleet only needs the next inference-ready snapshot of the policy.
Once that handoff becomes cheap and reliable, the entire systems picture changes.
On the Fireworks platform, the trainer and the inference fleet are intentionally decoupled.
The trainer runs locally where it makes sense to do synchronous training. It periodically emits inference-ready checkpoints. The rollout side keeps serving traffic through the regular inference API. The bridge between them is the hot-load system:
That is a very different problem from "stretch synchronous model-parallel training across continents." It is a snapshot distribution problem, not an all-reduce problem.
This is the architecture Fireworks used to support Cursor's Composer 2 training run. Federico Cassano wrote that the Composer 2 RL run was "distributed across 3 (sometimes 4) different clusters around the world." That is the point: this is not a hypothetical systems design. It is how Fireworks helped Cursor train a frontier-level model. (Federico Cassano on X)
In practice, the cadence can look like this: publish one full snapshot every N steps, then publish compact incremental snapshots in between. In the example below, N = 25, and the deltas stay in roughly the 3-5% range of the full weights. That transfer pattern is radically different from "ship the whole model every time."
arc_v2 actually doesarc_v2 is Fireworks' checkpoint delta format for hot-loaded inference snapshots.
At a high level, it stores a compressed delta against the previous snapshot, plus the metadata needed to reconstruct and verify the next one. Across nearby checkpoints, most tensors change only a little, so the delta stays small and compresses well.
Under the hood, the flow is roughly:
base checkpoint: full HF safetensors
delta checkpoint: delta(previous_snapshot, current_snapshot) -> zstd -> safetensors metadata with checksums
A few design choices make this operational:
The point is operational, not just conceptual: the trainer writes a specific delta format, and the inference side reconstructs and verifies it deterministically.
Recent prior art points in the same general direction. Miahi and Belilovsky study weight-update sparsity in distributed RL and propose PULSE, a lossless sparse-encoding scheme that transmits only changed parameter indices and values. Their mechanism is different from arc_v2 — sparse parameter patches versus compressed snapshot deltas for hot-loaded inference checkpoints — but the shared conclusion is important: distributed RL does not have to mean shipping a full checkpoint on every update. (Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL)
On the inference side, Fireworks does the reverse operation:
arc_v2 payload.Because the deployment is built for hot loading, you are not doing a cold restart of the serving stack for every policy update. You are swapping weights on a running inference system that already knows where to fetch the snapshot, how to monitor readiness, and how to keep the rollout API stable.
The important point is architectural: if the sampler fleet can stay warm while policy updates arrive as compact deltas, then the fleet does not need to live next to the trainer.
Compression alone is not enough. A usable cross-region RL system also needs orchestration:
That workflow matters. On the Fireworks platform, the trainer uploads a snapshot, signals hot load, waits for the deployment to report readiness, and then continues rollouts through the same inference APIs teams already use for production generation.
Another practical benefit is that Fireworks keeps training and rollout inference running asynchronously. The trainer can publish updates while the rollout side keeps admitting requests, preserving prompt-cache efficiency and maximizing time spent both serving inference and collecting on-policy data.
That matters because doing inference is hard. Doing inference for RL is harder. If a platform already gives you hot loads, admission continuity, cache reuse, and stable rollout serving, that is exactly the plumbing you want to offload.
The result is a much simpler RL loop than the traditional "bespoke training cluster plus custom sampler stack plus hand-managed weight propagation" setup.
That also creates a clean integration point for teams using a Tinker-style trainer stack: keep the trainer where it is, publish inference-ready snapshots, and let Fireworks handle rollout serving, hot-load updates, and cross-region orchestration.
The one-supercluster view of RL usually comes from importing a pretraining mental model into a different workload. Each of the standard objections gets much weaker once you treat RL as a policy-handoff problem instead of a full-cluster-copy problem.
That is true if "reaching the sampler fleet" means copying a full model checkpoint every time.
It is much less true if the fleet receives a compressed delta, reconstructs it locally, and hot-loads it without a full service restart. You are no longer paying full-checkpoint transfer costs on every step.
This is exactly the assumption arc_v2 attacks.
Most organizations do not have one perfectly contiguous giant cluster sitting idle for rollouts. They have capacity scattered across regions, with different supply, cost, and availability profiles. If policy refresh becomes lightweight enough, that global inference capacity becomes usable for RL.
Instead of waiting for one monolithic cluster, you can treat distributed rollout deployments as part of the same training system.
It is fragile if every update is a bespoke, error-prone dance.
It becomes far less fragile when the platform gives you:
Make the update path explicit, typed, and verifiable, and "distributed" stops meaning "chaotic."
Put differently, the wrong mental model is:
one giant trainer cluster
+ one giant co-located sampler cluster
+ full model copy on every update
= the only viable RL architecture
The better one is:
trainer publishes inference-ready snapshots
-> base snapshot occasionally
-> compact delta snapshots most of the time
-> hot-load rollout deployments across regions
-> sample through a standard inference API
That architecture does not eliminate all latency. It just eliminates the wrong latency. The trainer still does synchronous training where it needs to. But policy distribution to the rollout fleet becomes an incremental snapshot problem, which is far more tractable.
Giant co-located clusters can be the right tool for synchronous pretraining. That does not imply that RL rollouts must live on one giant co-located cluster.
If the trainer publishes inference-ready snapshots, most updates are compact deltas, and the inference fleet can hot-load those updates while staying online, then cross-region RL is practical.
The key question is not whether regions exist. It is whether your platform can make policy updates small, verifiable, and routine.
That is the design point behind Fireworks Training SDK: long-context training, full-parameter tuning, and cross-region RL rollouts on top of the same inference platform teams already use in production.
If you want to explore that setup, start with the Fireworks Training SDK introduction, which covers the Tinker-compatible control loop, checkpointing, weight sync, and training/sampling workflow.
If you want to talk through a rollout architecture or pair a Tinker-style trainer with Fireworks-hosted rollouts, email [email protected] or reach out on Discord.