Excited to launch a multi-year partnership bringing Fireworks to Microsoft Azure Foundry! Learn more

Blog
Why Building Mega Clusters Is Wrong

Why Building Mega Clusters Is Wrong

Frontier training is increasingly treated like a king's game, where only the few teams with one enormous cluster get to compete. Fireworks makes RL look different: cross-region rollouts, periodic full snapshots, compact deltas, and hot-load swaps let teams use distributed capacity effectively instead of waiting for one giant supercluster.

CROSS-REGION RL HOT-LOAD LOOPPOLICY TRAINER -> SNAPSHOT HANDOFF -> GLOBAL ROLLOUT FLEETpolicy trainerforward/backward + optim stepTRAINER OUTPUTbase snapshot every N stepsexample here: N = 25typical delta range: 3-5%active snapshot:policy-model-step-0001-fullsnapshot handoffbase refresh + compact deltasACTIVE TRANSFERstep 1checkpoint type: full refresh100.0% of full weightstypical delta: 3-5% of fullfleet ready in ~50.0sUS OHIO ROLLOUT100%full weight refresh+43 ms inter-region latency+43 msUS VIRGINIA ROLLOUT100%full weight refresh+58 ms inter-region latency+58 msEU FRANKFURT ROLLOUT100%full weight refresh+145 ms inter-region latency+145 ms90-95% less traffictypical delta 3-5%50-STEP SAMPLE WINDOWfull refresh every N steps, example shown here uses N = 251full100%5101520253035404550full refresh resets the chaintypical delta: 3-5% of fullroughly 90-95% less cross-region traffic than full-every-step

Why Building Mega Clusters Is Wrong

Why RL does not need a mega cluster

There is a growing assumption in AI that training a frontier model is a king's game: only the few players who can afford one giant cluster get to compete.

That framing has some force in pretraining. Systems like Colossus are built to keep a huge synchronous training job saturated.

But it does not transfer cleanly to RL. In RL, the competitive question is not just "who owns the biggest cluster?" It is "who can keep a rollout fleet on the freshest policy without wasting time and capacity on full reloads?"

If policy refresh can be done through periodic full snapshots, compact deltas, and hot-load updates across regions, then teams can use distributed capacity much more effectively. That lowers the infrastructure bar and makes frontier training less monopolistic than the giant-supercluster story suggests.

Why this matters for market structure for AI

If RL is treated as a giant-cluster problem, then frontier training starts to look like a market only a few companies can enter.

That is the real concern behind the hype around mega clusters: a handful of players control the only viable path to frontier models, and everyone else gets boxed out by infrastructure economics before they get to compete on algorithm quality or product execution.

Cross-region RL changes that equation. If rollout fleets can stay warm, policy updates can be hot-loaded, and distributed inference capacity can be used efficiently, then more teams can compete seriously. Fireworks pushes in that direction by making cross-region RL practical with stable inference APIs, faster policy refresh, admission continuity, and prompt-cache efficiency while training and rollout inference run asynchronously.

In RL, the trainer and the rollout fleet play different roles:

  • the trainer needs to do synchronous training where it makes sense
  • the rollout fleet only needs the latest inference-ready policy
  • reloading full weights every update is expensive
  • keeping the rollout fleet warm matters more than forcing everything into one location

If every policy refresh is a full checkpoint copy, then RL starts to inherit the same giant-cluster economics.

But that is the wrong unit of movement.

Cross-region RL does not require shipping optimizer state, activations, or a full set of model weights on every step. The rollout fleet only needs the next inference-ready snapshot of the policy.

Once that handoff becomes cheap and reliable, the entire systems picture changes.

What matters is how the policy reaches the rollout fleet

On the Fireworks platform, the trainer and the inference fleet are intentionally decoupled.

The trainer runs locally where it makes sense to do synchronous training. It periodically emits inference-ready checkpoints. The rollout side keeps serving traffic through the regular inference API. The bridge between them is the hot-load system:

  1. Publish a full Hugging Face SafeTensors base snapshot.
  2. Publish incremental snapshots as compact deltas against the previous snapshot.
  3. Signal Fireworks that a new snapshot is ready.
  4. Hot-load the new weights onto the running rollout deployment.
  5. Resume generating rollouts against the fresh policy.

That is a very different problem from "stretch synchronous model-parallel training across continents." It is a snapshot distribution problem, not an all-reduce problem.

This is the architecture Fireworks used to support Cursor's Composer 2 training run. Federico Cassano wrote that the Composer 2 RL run was "distributed across 3 (sometimes 4) different clusters around the world." That is the point: this is not a hypothetical systems design. It is how Fireworks helped Cursor train a frontier-level model. (Federico Cassano on X)

In practice, the cadence can look like this: publish one full snapshot every N steps, then publish compact incremental snapshots in between. In the example below, N = 25, and the deltas stay in roughly the 3-5% range of the full weights. That transfer pattern is radically different from "ship the whole model every time."

FULL SNAPSHOT EVERY N STEPSEXAMPLE SHOWN HERE USES N = 2550-STEP SAMPLE WINDOWgrey bars show the cost of shipping a full refresh every step15101520253035404550100%active step ships a full refresh100.00% of full weightsready in ~50.0sWHAT THE DATA GENERATOR IS MODELINGseeded sample for this posttypical delta: 3-5% of fullroughly 90-95% less cross-region traffic than full-every-stepfull = orangedelta = purple
Checkpoint CadenceThis visual shows the intended RL update rhythm: a periodic full refresh, then compact deltas in between, so the rollout fleet usually sees only a small fraction of the full weights.

What arc_v2 actually does

arc_v2 is Fireworks' checkpoint delta format for hot-loaded inference snapshots.

At a high level, it stores a compressed delta against the previous snapshot, plus the metadata needed to reconstruct and verify the next one. Across nearby checkpoints, most tensors change only a little, so the delta stays small and compresses well.

Under the hood, the flow is roughly:

base checkpoint: full HF safetensors
delta checkpoint: delta(previous_snapshot, current_snapshot) -> zstd -> safetensors metadata with checksums
ARC_V2 DELTA COMPRESSIONPER-TENSOR DELTA -> COMPACT PAYLOAD -> EXACT HOT-LOAD RECONSTRUCTION1. DELTA PER TENSORadjacent checkpoints differ in a few chunks, not everywherePREVCURRENTDELTASIZEembed_tokens00attn.q_proj0.2%attn.o_proj0.4%mlp.gate_up0.7%mlp.down_proj0.2%final_norm0.1%Unchanged chunks collapse to zero.Only sparse tensor deltas survive into the payload.2. PACKAGE CHANGED TENSORSkeep the changed pieces, compress them, then attach reconstruction metadatachanged tensorchunks onlyZstdarc_v2payloadDELTA BUFFERZSTD OUTPUTMETADATAprev_snapshot_idtensor_checksumshape + dtypeSparse delta data plus per-tensor metadata becomes the payload.That is enough for the hot-load path to reconstruct and verify exactly.3. HOT-LOAD RECONSTRUCTIONfetch the previous snapshot, decode, apply delta, verify, then swap weights in placefetch prev snapshotzstd decodeapply deltaverify checksumhot-load weights
arc_v2 Delta CompressionThis animation shows the `arc_v2` flow: compute a compact delta against the previous snapshot, compress it, and ship checksummed payloads that can be reconstructed exactly during hot load.

A few design choices make this operational:

  • The base snapshot is standard Hugging Face SafeTensors plus the config and tokenizer artifacts needed for inference.
  • The delta snapshot stores per-tensor checksum metadata so the load side can verify reconstruction.
  • The trainer can periodically force a fresh base save, which resets the delta chain and improves recoverability.
  • There is an end-to-end round-trip test in the repo that verifies trainer-produced deltas reconstruct correctly on the inference hot-load path, including a 2-GPU FSDP case.

The point is operational, not just conceptual: the trainer writes a specific delta format, and the inference side reconstructs and verifies it deterministically.

Recent prior art points in the same general direction. Miahi and Belilovsky study weight-update sparsity in distributed RL and propose PULSE, a lossless sparse-encoding scheme that transmits only changed parameter indices and values. Their mechanism is different from arc_v2 — sparse parameter patches versus compressed snapshot deltas for hot-loaded inference checkpoints — but the shared conclusion is important: distributed RL does not have to mean shipping a full checkpoint on every update. (Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL)

How inference turns the delta back into a model

On the inference side, Fireworks does the reverse operation:

  1. Read the previous snapshot.
  2. Decompress the arc_v2 payload.
  3. Apply the delta to reconstruct the current tensor bytes.
  4. Verify the checksum.
  5. Hot-load the reconstructed weights into the running deployment.

Because the deployment is built for hot loading, you are not doing a cold restart of the serving stack for every policy update. You are swapping weights on a running inference system that already knows where to fetch the snapshot, how to monitor readiness, and how to keep the rollout API stable.

The important point is architectural: if the sampler fleet can stay warm while policy updates arrive as compact deltas, then the fleet does not need to live next to the trainer.

Compression is only half the story

Compression alone is not enough. A usable cross-region RL system also needs orchestration:

  • shared object storage for trainer snapshots
  • a hot-load API that signals when a snapshot is ready
  • readiness reporting across replicas
  • OpenAI-compatible inference endpoints for rollouts
  • session-affinity headers so long trajectories keep their KV-cache efficiency
  • operational fallback via periodic full snapshots

That workflow matters. On the Fireworks platform, the trainer uploads a snapshot, signals hot load, waits for the deployment to report readiness, and then continues rollouts through the same inference APIs teams already use for production generation.

Another practical benefit is that Fireworks keeps training and rollout inference running asynchronously. The trainer can publish updates while the rollout side keeps admitting requests, preserving prompt-cache efficiency and maximizing time spent both serving inference and collecting on-policy data.

That matters because doing inference is hard. Doing inference for RL is harder. If a platform already gives you hot loads, admission continuity, cache reuse, and stable rollout serving, that is exactly the plumbing you want to offload.

The result is a much simpler RL loop than the traditional "bespoke training cluster plus custom sampler stack plus hand-managed weight propagation" setup.

That also creates a clean integration point for teams using a Tinker-style trainer stack: keep the trainer where it is, publish inference-ready snapshots, and let Fireworks handle rollout serving, hot-load updates, and cross-region orchestration.

Why this breaks the supercluster assumption

The one-supercluster view of RL usually comes from importing a pretraining mental model into a different workload. Each of the standard objections gets much weaker once you treat RL as a policy-handoff problem instead of a full-cluster-copy problem.

1. "The newest policy can't reach the sampler fleet fast enough"

That is true if "reaching the sampler fleet" means copying a full model checkpoint every time.

It is much less true if the fleet receives a compressed delta, reconstructs it locally, and hot-loads it without a full service restart. You are no longer paying full-checkpoint transfer costs on every step.

2. "Global inference capacity is useless because policy refresh is too expensive"

This is exactly the assumption arc_v2 attacks.

Most organizations do not have one perfectly contiguous giant cluster sitting idle for rollouts. They have capacity scattered across regions, with different supply, cost, and availability profiles. If policy refresh becomes lightweight enough, that global inference capacity becomes usable for RL.

Instead of waiting for one monolithic cluster, you can treat distributed rollout deployments as part of the same training system.

POLICY FRESHNESS TIMELINEHOT-LOAD DELTAS KEEP THE FLEET WARM; FULL RESTARTS INSERT GAPSPOLICY VERSIONS OVER TIMEv41v42v43load v42load v43HOT-LOAD FLEETreplicas stay warmno restart gapFULL RESTARTserving pausesgapgaprestart gapSmaller updates matter because the fleet keeps sampling under a fresher policy.That is what makes distributed rollout capacity usable for RL.usable RL capacity
Policy Freshness TimelineThe real payoff is policy freshness: hot-load deltas keep the rollout fleet warm and sampling, while full-copy restarts create visible gaps where global inference capacity stops being useful.

3. "Cross-region RL is too operationally fragile"

It is fragile if every update is a bespoke, error-prone dance.

It becomes far less fragile when the platform gives you:

  • a formal snapshot identity
  • explicit previous-snapshot lineage
  • checksum verification
  • readiness polling per replica
  • optional periodic base checkpoints for recovery

Make the update path explicit, typed, and verifiable, and "distributed" stops meaning "chaotic."

Put differently, the wrong mental model is:

one giant trainer cluster
  + one giant co-located sampler cluster
  + full model copy on every update
  = the only viable RL architecture

The better one is:

trainer publishes inference-ready snapshots
  -> base snapshot occasionally
  -> compact delta snapshots most of the time
  -> hot-load rollout deployments across regions
  -> sample through a standard inference API

That architecture does not eliminate all latency. It just eliminates the wrong latency. The trainer still does synchronous training where it needs to. But policy distribution to the rollout fleet becomes an incremental snapshot problem, which is far more tractable.

Fireworks' view

Giant co-located clusters can be the right tool for synchronous pretraining. That does not imply that RL rollouts must live on one giant co-located cluster.

If the trainer publishes inference-ready snapshots, most updates are compact deltas, and the inference fleet can hot-load those updates while staying online, then cross-region RL is practical.

The key question is not whether regions exist. It is whether your platform can make policy updates small, verifiable, and routine.

That is the design point behind Fireworks Training SDK: long-context training, full-parameter tuning, and cross-region RL rollouts on top of the same inference platform teams already use in production.

If you want to explore that setup, start with the Fireworks Training SDK introduction, which covers the Tinker-compatible control loop, checkpointing, weight sync, and training/sampling workflow.

If you want to talk through a rollout architecture or pair a Tinker-style trainer with Fireworks-hosted rollouts, email [email protected] or reach out on Discord.