Frontier-lab training infrastructure, now as a service

The features that used to be frontier-lab-only

Why it matters: frontier specialized intelligence as a service

frontier-lab infrastructure as a service

The hard part of reinforcement learning on a frontier model has never been the algorithm. It's the infrastructure that keeps training and inference numerically identical: zero KLD, end to end. At Fireworks, we've long invested in solving this challenge, and are now offering it as a managed service, starting with GLM 5.2.

For years, the club of teams getting reinforcement learning to actually work on giant models was limited to the top frontier labs. Not because they had better algorithms, but because they had the unglamorous infrastructure underneath: training and serving stacks engineered to produce the same numbers, run after run, request after request.

That infrastructure is built on concepts people outside those labs have never had to think about, for example batch invariance and zero-KLD across training and serving: making the rollout engine and trainer produce the same numbers end to end. It sounds like plumbing, but it's the difference between an RL run that hill climbs successfully, and one that quietly falls apart.

The toolkit that used to be locked inside a handful of labs is now available as a managed service on Fireworks on GLM 5.2, the OSS model sitting at the very top of independent community leaderboards. We are offering frontier specialized intelligence as a service: not a frontier you rent prompts from, but one you can post-train and reinforce on the same infrastructure the top labs built in-house.

The features that used to be frontier-lab-only

Reinforcement learning on an LLM is a loop: the model generates responses, those responses are scored, and the trainer nudges the weights. The whole thing rests on the assumption that the probability the trainer thinks the model gave each token matches the probability the serving engine actually used to generate it. When that holds, learning signal flows. When it doesn't, you're optimizing against noise.

Holding that assumption on a modern frontier model is genuinely hard, and Fireworks gives you the tools the big labs built to do it:

•Batch invariance for Large MoEs - a request returns the same result no matter what other traffic happens to share its batch. Without it, an "on-policy" run is quietly off-policy: the rollout was generated by a subtly different model than the one you're updating, just because the server was busy. The building blocks are increasingly public: open-source engines ship batch-invariant kernels for smaller, dense models (vLLM, SGLang), and DeepSeek's DeepGEMM, the kernel library behind DeepSeek-V4, provides batch-invariant grouped-GEMM kernels for MoEs. It replaces cuBLAS end to end and drops the split-K trick precisely because split-K breaks invariance. But a pile of batch-invariant kernels is not a batch-invariant system. True end-to-end invariance means every reduction - attention, the MoE router, the expert GEMMs, and the multi-rank all-reduce stack - stays consistent together, under real production load. Delivering that whole-system guarantee for a frontier MoE like GLM 5.2 as a managed service is, to our knowledge, a first in the industry.
•Zero-KLD train/serve alignment - for a model like GLM 5.2, the usual patches don't reach far enough. The popular Mixture-of-Experts fix, router replay (replaying the serving engine's expert choices inside the trainer), handles which experts fire but it can't touch the other place these models diverge: which tokens the sparse-attention indexer selects. Those selections aren't tractable to replay. So there's no halfway house here. The generation engine, the prompt-reading path, and the trainer have to share one numerical definition, so what you train is exactly what you served: zero KLD, end to end.

zero-kld request alignment — Zero-KLD Train/Serve Loop: The same request flows through serving prefill and generation, then through the trainer path. The two streams converge when served logits and trainer logprobs match at KLD = 0.

These are exactly the pieces most platforms don't have, and their absence is why so many RL efforts stall.

Validations

Here is the same RL task, the GLM countdown reasoning task, run two ways. Both use the same algorithm and data. The only difference is the numerics underneath.

same rl task, different numerics — Validation Runs: The same GLM countdown reasoning task behaves differently when the trainer and rollout engine disagree: reward collapses and clipping throws away learning signal. With zero-KLD numerics, the loop stays on-policy.

Without the Fireworks numerics stack, the trainer and the rollout engine disagree (train-inference KL around 0.013), and the run leans hard on the industry's usual crutch: importance sampling and clipping were discarding about 45% of every batch's tokens just to stay upright. It still wasn't enough. Around step 20 the reward collapses, falling from around 0.9 to under 0.2 as the policy chases a target that no longer matches what it generated.

With the Fireworks stack, the trainer and serving engine run at zero KLD, end to end - bit-for-bit identical - with zero tokens clipped, and reward stays healthy across the entire run. Same task, same algorithm. The only thing that changed was making the numbers agree.

Run	Train-inference KL	Clipped tokens	Reward behavior
Without the Fireworks numerics stack	Around 0.013 and drifting	About 45%	Collapses around step 20, from around 0.9 to under 0.2
With the Fireworks stack	0	0%	Stays healthy across the full 25-step run

That's the trap with the importance-sampling-and-clipping approach: it's a tax, not a fix. Every clipped token is learning signal thrown away, and past a point no amount of clipping saves a run whose numbers don't line up.

How we did it, and why it's hard

It comes down to a property of floating-point math that trips up almost everyone: addition isn't associative. (a + b) + c doesn't equal a + (b + c) down at the bit level, so the order in which a GPU adds numbers up changes the answer, usually in the last few digits, occasionally by enough to flip a token.

A frontier MoE changes that order constantly, for reasons that have nothing to do with your request:

•Latent attention (MLA) - the compression trick that makes GLM cheap at long context splits its reduction into chunks across the GPU whose boundaries shift with whatever sequence lengths happen to share the batch, so the same query can accumulate its attention in a different order from one moment to the next.
•The sparse indexer that decides which past tokens each query even looks at can hand back the same set in a different order. Because the attention sum follows that order, the result drifts.
•Each expert's matmul runs a different kernel and tiling depending on how many tokens that expert drew this step, which depends on everyone else's tokens, not just yours.
•The router can land on a near-tie between two experts; a rounding-error-sized wobble flips which one fires, and the token's entire computation changes with it.
•Across GPUs, the all-reduce that stitches partial sums back together switches algorithms by message size, which again rides on load.

Stack those together and "temperature 0" on a busy server is quietly nondeterministic: the same prompt, co-batched with different traffic, comes back subtly and sometimes meaningfully different. That's the gap that silently turns an on-policy RL run off-policy.

Getting to numerics you can trust meant pinning every one of those decisions so it depends only on your request: a fixed reduction order in attention no matter the batch, one settled kernel choice for the expert matmuls regardless of token counts, a deterministic tie-break in the router, and a single fixed cross-GPU reduction path, all without giving up so much speed that the trainer becomes unusable for everyday SFT/DPO.

We then aligned the trainer's own forward pass to that same serving definition so on the validated GLM-5.2 LoRA path, trainer and serving reach zero KLD - a train-inference generation KL of exactly 0, bit-for-bit identical, and the served model returns the same output at temperature 0 regardless of concurrency.

Zero KLD, end to end

Tiny train-inference numerical disagreements aren't benign noise: they quietly turn on-policy RL off-policy and can independently cause a run to collapse. For related background on inference nondeterminism, see TML; for the RL failure mode, see this diagnosis. The clean answer isn't to correct the gap with importance sampling, which only piles on variance; it's to erase it. When training and inference are made bitwise consistent so the KLD is exactly 0, RL trains in fewer steps and reaches higher reward (vLLM x TorchTitan).

That's what Fireworks delivers on GLM 5.2: batch-invariant serving with zero-KLD train-inference alignment. The rollout engine returns bit-for-bit identical logits no matter the batch size, the concurrent load, or how many GPUs it's sharded across, and the trainer is held to that same bit-exact standard, so the full train-rollout loop runs at zero KLD. This is the guarantee frontier labs build in-house, and it's genuinely hard to hold across different engines, kernels, and parallelism layouts. Determinism across tensor-parallel sizes alone was an open research problem as recently as this year (TBIK, ICML 2026).

The catch everyone else hits is speed: open-source deterministic modes typically run 35-60% slower (SGLang). Fireworks pays virtually none of that tax. The GLM trainer holds around 3,500 tokens/sec per node, on par with the OSS TileLang implementation, and layers the zero-KLD numerics on top rather than trading speed for them. One stack gives the SFT/DPO majority full throughput and gives RL teams zero-KLD numerics: frontier-lab infrastructure, delivered as a managed service.

What you get on Fireworks today

A frontier model, RL-ready. GLM 5.2 is live for fine-tuning through the Fireworks Training API, with the full numerical foundation carried forward from GLM 5.1. The validated training shape is public today.
The methods that matter. SFT, DPO, and RL through the Training API; SFT and DPO on managed training. On-policy RL where the trainer and serving engine genuinely agree so your signal is real learning, not drift you clipped away.
Reproducible, auditable inference. Temperature-0 requests return the same answer regardless of server load so you get trustworthy evals, meaningful regression tests, and the reproducibility enterprise compliance demands.
Fast where it counts. A trainer tuned for SFT/DPO throughput, rollouts that generate about 1.8x faster on GLM 5.2 than GLM 5.1 (around 5,000 tokens/sec per node, per promotion-CI), and zero-KLD train-serve numerics, so RL is genuinely on-policy.
Managed, self-service, and co-located. Run the full loop on managed infrastructure with trainer and deployment co-located for fast weight sync, or drive longer runs yourself through the API.

Why it matters: frontier specialized intelligence as a service

The frontier is open now. Anyone can download a state-of-the-art model. What's still scarce is the ability to do reinforcement learning on one of these models correctly: a loop whose numbers line up well enough to converge, results you can reproduce, and a trainer fast enough to iterate on. That used to live only inside the biggest labs.

That's the work behind GLM 5.1 and GLM 5.2 on Fireworks: bitwise zero-KLD numerics for RL and raw speed for SFT/DPO, in one stack. It's exactly the infrastructure that used to require a frontier lab's in-house systems team, now available as a managed service with Fireworks. Frontier weights are the raw material. Numerics you can trust - batch invariance and zero-KLD train/serve consistency - are what turn them into your model.

In the spirit of the legendary NFL coach Vince Lombardi:

"Practice does not make perfect. Only perfect practice makes perfect."

That's the essence of frontier specialized intelligence as a service: a perfect RL training system which empowers AI natives and modern enterprises to own their AI future.

To get started, please reach out to our training team today, or dive straight in with our managed training docs.

Frontier-lab Training Infrastructure, Available Now as a Managed Service for GLM 5.2

Table of Contents

Table of Contents

The features that used to be frontier-lab-only

Validations

How we did it, and why it's hard

Zero KLD, end to end

What you get on Fireworks today

Why it matters: frontier specialized intelligence as a service