
The hard part of reinforcement learning on a frontier model has never been the algorithm. It's the infrastructure that keeps training and inference numerically identical: zero KLD, end to end. At Fireworks, we've long invested in solving this challenge, and are now offering it as a managed service, starting with GLM 5.2.
For years, the club of teams getting reinforcement learning to actually work on giant models was limited to the top frontier labs. Not because they had better algorithms, but because they had the unglamorous infrastructure underneath: training and serving stacks engineered to produce the same numbers, run after run, request after request.
That infrastructure is built on concepts people outside those labs have never had to think about, for example batch invariance and zero-KLD across training and serving: making the rollout engine and trainer produce the same numbers end to end. It sounds like plumbing, but it's the difference between an RL run that hill climbs successfully, and one that quietly falls apart.
The toolkit that used to be locked inside a handful of labs is now available as a managed service on Fireworks on GLM 5.2, the OSS model sitting at the very top of independent community leaderboards. We are offering frontier specialized intelligence as a service: not a frontier you rent prompts from, but one you can post-train and reinforce on the same infrastructure the top labs built in-house.
Reinforcement learning on an LLM is a loop: the model generates responses, those responses are scored, and the trainer nudges the weights. The whole thing rests on the assumption that the probability the trainer thinks the model gave each token matches the probability the serving engine actually used to generate it. When that holds, learning signal flows. When it doesn't, you're optimizing against noise.
Holding that assumption on a modern frontier model is genuinely hard, and Fireworks gives you the tools the big labs built to do it:
These are exactly the pieces most platforms don't have, and their absence is why so many RL efforts stall.
Here is the same RL task, the GLM countdown reasoning task, run two ways. Both use the same algorithm and data. The only difference is the numerics underneath.
Without the Fireworks numerics stack, the trainer and the rollout engine disagree (train-inference KL around 0.013), and the run leans hard on the industry's usual crutch: importance sampling and clipping were discarding about 45% of every batch's tokens just to stay upright. It still wasn't enough. Around step 20 the reward collapses, falling from around 0.9 to under 0.2 as the policy chases a target that no longer matches what it generated.
With the Fireworks stack, the trainer and serving engine run at zero KLD, end to end - bit-for-bit identical - with zero tokens clipped, and reward stays healthy across the entire run. Same task, same algorithm. The only thing that changed was making the numbers agree.
| Run | Train-inference KL | Clipped tokens | Reward behavior |
|---|---|---|---|
| Without the Fireworks numerics stack | Around 0.013 and drifting | About 45% | Collapses around step 20, from around 0.9 to under 0.2 |
| With the Fireworks stack | 0 | 0% | Stays healthy across the full 25-step run |
That's the trap with the importance-sampling-and-clipping approach: it's a tax, not a fix. Every clipped token is learning signal thrown away, and past a point no amount of clipping saves a run whose numbers don't line up.
It comes down to a property of floating-point math that trips up almost everyone: addition isn't associative. (a + b) + c doesn't equal a + (b + c) down at the bit level, so the order in which a GPU adds numbers up changes the answer, usually in the last few digits, occasionally by enough to flip a token.
A frontier MoE changes that order constantly, for reasons that have nothing to do with your request:
Stack those together and "temperature 0" on a busy server is quietly nondeterministic: the same prompt, co-batched with different traffic, comes back subtly and sometimes meaningfully different. That's the gap that silently turns an on-policy RL run off-policy.
Getting to numerics you can trust meant pinning every one of those decisions so it depends only on your request: a fixed reduction order in attention no matter the batch, one settled kernel choice for the expert matmuls regardless of token counts, a deterministic tie-break in the router, and a single fixed cross-GPU reduction path, all without giving up so much speed that the trainer becomes unusable for everyday SFT/DPO.
We then aligned the trainer's own forward pass to that same serving definition so on the validated GLM-5.2 LoRA path, trainer and serving reach zero KLD - a train-inference generation KL of exactly 0, bit-for-bit identical, and the served model returns the same output at temperature 0 regardless of concurrency.
Tiny train-inference numerical disagreements aren't benign noise: they quietly turn on-policy RL off-policy and can independently cause a run to collapse. For related background on inference nondeterminism, see TML; for the RL failure mode, see this diagnosis. The clean answer isn't to correct the gap with importance sampling, which only piles on variance; it's to erase it. When training and inference are made bitwise consistent so the KLD is exactly 0, RL trains in fewer steps and reaches higher reward (vLLM x TorchTitan).
That's what Fireworks delivers on GLM 5.2: batch-invariant serving with zero-KLD train-inference alignment. The rollout engine returns bit-for-bit identical logits no matter the batch size, the concurrent load, or how many GPUs it's sharded across, and the trainer is held to that same bit-exact standard, so the full train-rollout loop runs at zero KLD. This is the guarantee frontier labs build in-house, and it's genuinely hard to hold across different engines, kernels, and parallelism layouts. Determinism across tensor-parallel sizes alone was an open research problem as recently as this year (TBIK, ICML 2026).
The catch everyone else hits is speed: open-source deterministic modes typically run 35-60% slower (SGLang). Fireworks pays virtually none of that tax. The GLM trainer holds around 3,500 tokens/sec per node, on par with the OSS TileLang implementation, and layers the zero-KLD numerics on top rather than trading speed for them. One stack gives the SFT/DPO majority full throughput and gives RL teams zero-KLD numerics: frontier-lab infrastructure, delivered as a managed service.
The frontier is open now. Anyone can download a state-of-the-art model. What's still scarce is the ability to do reinforcement learning on one of these models correctly: a loop whose numbers line up well enough to converge, results you can reproduce, and a trainer fast enough to iterate on. That used to live only inside the biggest labs.
That's the work behind GLM 5.1 and GLM 5.2 on Fireworks: bitwise zero-KLD numerics for RL and raw speed for SFT/DPO, in one stack. It's exactly the infrastructure that used to require a frontier lab's in-house systems team, now available as a managed service with Fireworks. Frontier weights are the raw material. Numerics you can trust - batch invariance and zero-KLD train/serve consistency - are what turn them into your model.
In the spirit of the legendary NFL coach Vince Lombardi:
"Practice does not make perfect. Only perfect practice makes perfect."
That's the essence of frontier specialized intelligence as a service: a perfect RL training system which empowers AI natives and modern enterprises to own their AI future.
To get started, please reach out to our training team today, or dive straight in with our managed training docs.