Excited to announce that Fireworks Training is now in preview. Train and deploy frontier models on one platform. Learn more

Blog
Notes on DeepSeek-V4's training system

Notes on DeepSeek-V4's training system

Notes on DeepSeek-V4: hybrid attention, anticipatory routing, reasoning modes, generative reward models, on-policy distillation, and what they imply for programmable training infrastructure.

Notes on DeepSeek-V4's training system

DeepSeek-V4 is interesting less for any single benchmark number than for the shape of the system around it. The paper shows architecture, routing, reward modeling, reasoning modes, distillation, and agent execution all becoming part of the training loop.

The useful takeaway for training infrastructure is obvious: fixed recipes are not enough. Researchers increasingly need programmable loops, while the platform handles distributed execution, inference integration, checkpointing, and scaling underneath. Supporting that flexibility is the core design principle behind the Fireworks Training API.

Six ingredients in DeepSeek-V4's training recipe

1. Long context is becoming a memory hierarchy

DeepSeek-V4 alternates between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries and then does sparse top-k selection. HCA compresses more aggressively, but keeps dense attention over the compressed memory.

The point is not just "longer context." It is model/runtime co-design: attention pattern, KV layout, precision, sparse selection, and inference kernels all have to line up. Training has the same problem. If serving uses custom kernels and compressed caches, evaluation during training needs to be close enough to serving that we are not optimizing against the wrong system.

Training platform design note: This is where training shapes, checkpoint promotion, and weight sync into deployments become relevant. Architecture-specific work is not just a loss function; the platform has to launch the right trainer, save usable checkpoints, and evaluate the same model/runtime combination that will serve users.

ONE QUERY, TWO CACHE CONTRACTSBoth layer types keep recent tokens exact. They differ in how older history is compressed and read.LAYER TYPEOLDER HISTORY CACHERECENT EXACT TOKENSQUERY READSCSAmore blockssparse readtop-kmoderate compressionm tokens per blockthe indexer marks the old blocks read nowuncompressedtail bufferselected old blocks+ exact recent tailtop-k + tailHCAfewer blocksdense readall compactheavy compressionm' >> m tokens per blockthe compressed old cache is small enough to read denselysame localtail bufferall compact blocks+ exact recent taildense + tailTraining evals have to reproduce this cache contract, not just load the same weights.
Query Read ContractFor each query, CSA reads exact recent tokens plus a top-k subset of moderately compressed old blocks. HCA reads exact recent tokens plus all of a smaller, heavily compressed old cache.

2. MoE routing is a stability problem

The most interesting pretraining trick is Anticipatory Routing. DeepSeek reports that loss spikes were tied to MoE outliers and routing. Their fix decouples features from routes: at step t, features are computed with current weights, but routing indices come from older weights, theta_{t - delta}.

To avoid running the model twice, they prefetch a future batch, compute its routing decisions early with the older router, cache those routes, and reuse them later. They report about 20% overhead while this mode is active, and only turn it on after a spike detector triggers rollback.

This is not a clean new objective. It is a conditional runtime intervention: detect instability, roll back, change routing behavior, cache side-channel data, then return to normal training.

Training platform design note: Fireworks has adjacent primitives in its rollout/training stack: rollout sampling can return per-token logprobs, MoE rollout paths can carry routing metadata such as routing_matrices, and training datums can carry model inputs plus side-channel fields. That is not DeepSeek's full historical-router system, but it points in the same direction: routing decisions sometimes need to become data in the training loop.

FRESH FEATURES, OLD ROUTESRoutes are computed earlier with an older router, cached, then reused when the same batch trains.timeearlierprefetch phaselook ahead at thenext batch and runthe older router on itprefetched B_ttok0tok1tok2same tokens as the future stepolder routertheta(t-delta)decide which experteach token should go toe3e7e1route cachekeyed by batch id, kept until step ttok0expert e3tok1expert e7tok2expert e1step ttraining stepfeatures computedwith current weights;routes pulled from cachebatch B_ttok0tok1tok2now actually trainedtoken featurestheta(t)fresh activations fromthe current modelMoE dispatchcurrent activations to cached expertsexperte3experte7experte1lookupOnly the routes are stale; the forward/backward pass still trains the current model.
Anticipatory RoutingThe current model still computes activations and gradients. Only the MoE expert IDs are stale, coming from a cached route computed by an older router.

3. Reasoning effort is trained behavior

DeepSeek-V4 exposes three modes from the same weights: Non-think, Think High, and Think Max. These are trained with different RL configurations, length penalties, context windows, and response formats. Think Max also gets an explicit system instruction pushing exhaustive reasoning.

This makes "reasoning effort" less mysterious. It is not just a runtime flag; it is a behavior contract backed by data, reward design, formatting, and evaluation.

Training platform design note: A programmable loop can treat modes as training conditions: vary prompt format, response template, sampling budget, reward shaping, loss weights, and eval criteria per phase. Rollout inference plus checkpoint hotloading then lets you test whether the modes actually diverge behaviorally.

ONE MODEL, THREE TRAINED BEHAVIORSDeepSeek-V4 ships three response styles from the same weights — each one is its own training recipe.shared weightsone model, frozen after trainingNon-thinkJust give the answer.ASKED: WHAT IS 17 × 23?391.TRAINING RECIPEAsked toAnswer directlyLengthRewarded forShort and correctGraded onFinal answer matchesThink HighShow your work.ASKED: WHAT IS 17 × 23?17 × 20 = 340.17 × 3 = 51.340 + 51 = 391.TRAINING RECIPEAsked toReason step by stepLengthRewarded forClear, correct stepsGraded onSteps and answerThink MaxThink it through fully.ASKED: WHAT IS 17 × 23?Let me work this outcarefully. 17 × 23 canbe split as 17 × (20+3),which gives 340 + 51,so the answer is 391.TRAINING RECIPEAsked toReason exhaustivelyLengthRewarded forDepth and rigorGraded onWhole reasoning traceReasoning effort isn't a runtime flag — it's behavior the model was trained, separately, to produce.
Reasoning ModesNon-think, Think High, and Think Max are behavior contracts backed by different prompt formats, budgets, reward shaping, and evaluation criteria.

4. Reward modeling becomes generative

For hard-to-verify tasks, DeepSeek says it discards conventional scalar reward models and uses a Generative Reward Model (GRM). The actor model itself acts as the judge, and RL optimizes that evaluative behavior alongside generation.

This matters because many valuable tasks are not easy to score with exact-match checks: writing, design, tool use, research synthesis, subjective quality, and long-horizon agent behavior. In those cases, evaluation looks more like deliberation than scalar prediction.

There are obvious risks: self-judging models can learn blind spots or reward hacking. But the direction is important. Reward is moving from a separately trained scalar reward model toward model-generated evaluative behavior optimized with RL.

Training platform design note: The relevant abstraction is a sample -> judge -> train loop. Generate candidates, call a verifier/tool/evaluator/model-as-judge, turn feedback into reward or custom loss, then update the policy. The platform should not need to know in advance whether feedback came from a unit test, rubric, human label, or generative judge.

FROM A SEPARATE SCORE TO MODEL JUDGMENTFor hard-to-verify tasks, DeepSeek-V4 trains the actor's own judging behavior instead of a scalar reward model.OLD WAY: A SINGLE SCOREANSWER TO EVALUATE"I called search() and used the top result to..."a model response on a writing or tool-use taskWHO GRADES ITa separate scoring modela small extra model trained to output one numberFEEDBACK FOR TRAINING0.73one number, no reasoning attachedFine for math (right or wrong). Brittle for writing, design, or tool use.NEW WAY: GENERATIVE REWARD MODELANSWER TO EVALUATE"I called search() and used the top result to..."same response, same taskWHO GRADES ITthe actor model judges trajectoriesgeneration and evaluation are optimized togetherFEEDBACK FOR TRAINING+rubric-guided trajectory evaluation+model judges task-specific quality+evaluative behavior optimized by RL-no separate scalar reward modelThe training signal comes from the model's learned evaluator behavior.The platform boundary is the same: sample trajectories, evaluate them, turn the signal into an update.
Generative Reward LoopFor hard-to-score tasks, DeepSeek-V4 replaces a separate scalar reward model with generative reward modeling, where the actor model learns evaluative behavior alongside generation.

5. Specialists first, then on-policy distillation

DeepSeek does not train the final model with one mixed RL stage. It first trains domain specialists. For each domain, it starts with supervised fine-tuning on high-quality domain data, then runs GRPO with domain-specific prompts and reward signals.

Those specialists are then merged through On-Policy Distillation (OPD). The student samples its own rollouts, and more than ten teacher models provide target distributions on those trajectories. This avoids directly merging weights or hoping one composite reward can produce every behavior at once.

The expensive part is full-vocabulary logit distillation. Instead of matching only sampled tokens, DeepSeek matches the full teacher distribution. To make that feasible, it caches teacher last-layer hidden states and reconstructs logits on the fly with the relevant teacher head.

Training platform design note: The concrete loop is rollout sampling, teacher/reference forward passes, custom loss computation, optimizer steps, and checkpoint/weight sync back into deployment for evaluation. DeepSeek-style full-vocab OPD would need extra primitives for full logits or hidden-state streaming, but the outer loop is exactly the kind of thing a programmable training API should express.

MANY EXPERTS REVIEW THE STUDENT'S OWN DRAFTThe student writes the answer. Each specialist suggests what should come next, at every word the student wrote.TRAINED SPECIALISTSeach one is good at a single domainmath experttrained on math problemscode experttrained on codeagent experttrained on tool useTHE STUDENT WRITES THE ANSWERthis draft is what the student gets graded onStep2:subtractAfromBzoom into one positionAT THIS WORD, EACH SPECIALIST PREDICTS WHAT SHOULD COME NEXTthe student updates to match the blend of all three suggestionsmath expertsuggestssubtract0.55compute0.22find0.14take0.09code expertsuggestsB - A0.46subtract0.32diff0.14minus0.08agent expertsuggestssubtract0.42first0.28perform0.20do0.10Corrections land on the student's actual mistakes — not on a teacher's perfect example.
On-Policy DistillationA student samples its own trajectories while specialist teachers provide target distributions on those same rollouts.

6. Agentic training pulls runtime into the loop

DeepSeek-V4 also changes how reasoning state is handled in tool-use conversations. For agentic workflows, it preserves reasoning traces across user/tool boundaries instead of flushing them after every user message. For ordinary chat, it still discards prior reasoning to keep context concise.

They also introduce Quick Instruction: special tokens appended to the main context for auxiliary decisions like search/no-search, query generation, domain classification, and URL reading. Instead of calling a separate controller model and paying redundant prefill, these decisions reuse the main model's KV cache.

Finally, agentic training needs sandboxes. DeepSeek describes a preemptible rollout service with token-granular write-ahead logs, and DeepSeek Elastic Compute (DSec) for sandbox execution. DSec keeps ordered trajectory logs of commands and results, which makes tool execution part of the training/evaluation record.

Training platform design note: For agentic RL, rollout execution is part of the learning system. Sampling, tool/evaluator integration, durable trajectory logs, checkpointing, and replayable execution environments matter as much as the optimizer.

ASK THE SAME MODEL INSTEAD OF A NEW ONESide decisions like "should I search?" don't need a separate model. Append one special token and let the main model continue.OLD WAY: SEPARATE CONTROLLER MODELSTEP 1: MAIN MODEL READS THE CONVERSATIONMAIN MODELHowtallistheeiffeltower?...already in the main model's working memorySTEP 2: A SECOND MODEL ANSWERS "SHOULD I SEARCH?"CONTROLLER MODEL(separate, smaller)Has to read the same 8 tokens AGAIN, plus the question:Howtallistheeiffeltower?...+<search?>red dashes = work the main model already didAnswers: yesCost: the conversation gets read TWICENEW WAY: APPEND ONE TOKENSTEP 1: MAIN MODEL READS THE CONVERSATIONMAIN MODELHowtallistheeiffeltower?...already in the main model's working memorySTEP 2: APPEND ONE SPECIAL TOKEN AND CONTINUESAME MAIN MODEL(just continues)Already has those 8 tokens. Just appends one more:Howtallistheeiffeltower?...+<search?>outlined = still in cache, never re-readAnswers: yesCost: the conversation gets read ONCESame answer to "should I search?" — but the main model never has to re-read what it already saw.
Agentic Execution InfrastructureAgentic RL needs more than optimizer code: quick decisions, tool execution, durable trajectory logs, replayable sandboxes, and reward evaluation all sit in the loop.

What this adds up to

The through-line across these six pieces of DeepSeek-V4 is the same: important training decisions now reach into runtime, evaluation, and infrastructure. Architecture co-designs with serving, routing becomes a stability problem with side-channel state, and reasoning effort, reward, distillation, and tool use all turn into programmable loops over rollouts, judges, checkpoints, and execution logs.

That is the bet behind the Fireworks Training API: expose the loop, and handle the system underneath. Full-vocabulary OPD, historical-router caches, and DSec-style sandboxed rollout services would need more primitives than most platforms expose today, but the direction is hard to argue with: fixed training pipelines are giving way to programmable training systems.