On this page

Notes on DeepSeek-V4's training system

DeepSeek-V4 is interesting less for any single benchmark number than for the shape of the system around it. The paper shows architecture, routing, reward modeling, reasoning modes, distillation, and agent execution all becoming part of the training loop.

The useful takeaway for training infrastructure is obvious: fixed recipes are not enough. Researchers increasingly need programmable loops, while the platform handles distributed execution, inference integration, checkpointing, and scaling underneath. Supporting that flexibility is the core design principle behind the Fireworks Training API.

Six ingredients in DeepSeek-V4's training recipe

1. Long context is becoming a memory hierarchy

DeepSeek-V4 alternates between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries and then does sparse top-k selection. HCA compresses more aggressively, but keeps dense attention over the compressed memory.

The point is not just "longer context." It is model/runtime co-design: attention pattern, KV layout, precision, sparse selection, and inference kernels all have to line up. Training has the same problem. If serving uses custom kernels and compressed caches, evaluation during training needs to be close enough to serving that we are not optimizing against the wrong system.

Training platform design note: This is where training shapes, checkpoint promotion, and weight sync into deployments become relevant. Architecture-specific work is not just a loss function; the platform has to launch the right trainer, save usable checkpoints, and evaluate the same model/runtime combination that will serve users.

Query Read ContractFor each query, CSA reads exact recent tokens plus a top-k subset of moderately compressed old blocks. HCA reads exact recent tokens plus all of a smaller, heavily compressed old cache.

2. MoE routing is a stability problem

The most interesting pretraining trick is Anticipatory Routing. DeepSeek reports that loss spikes were tied to MoE outliers and routing. Their fix decouples features from routes: at step t, features are computed with current weights, but routing indices come from older weights, theta_{t - delta}.

To avoid running the model twice, they prefetch a future batch, compute its routing decisions early with the older router, cache those routes, and reuse them later. They report about 20% overhead while this mode is active, and only turn it on after a spike detector triggers rollback.

This is not a clean new objective. It is a conditional runtime intervention: detect instability, roll back, change routing behavior, cache side-channel data, then return to normal training.

Training platform design note: Fireworks has adjacent primitives in its rollout/training stack: rollout sampling can return per-token logprobs, MoE rollout paths can carry routing metadata such as routing_matrices, and training datums can carry model inputs plus side-channel fields. That is not DeepSeek's full historical-router system, but it points in the same direction: routing decisions sometimes need to become data in the training loop.

Anticipatory RoutingThe current model still computes activations and gradients. Only the MoE expert IDs are stale, coming from a cached route computed by an older router.

3. Reasoning effort is trained behavior

DeepSeek-V4 exposes three modes from the same weights: Non-think, Think High, and Think Max. These are trained with different RL configurations, length penalties, context windows, and response formats. Think Max also gets an explicit system instruction pushing exhaustive reasoning.

This makes "reasoning effort" less mysterious. It is not just a runtime flag; it is a behavior contract backed by data, reward design, formatting, and evaluation.

Training platform design note: A programmable loop can treat modes as training conditions: vary prompt format, response template, sampling budget, reward shaping, loss weights, and eval criteria per phase. Rollout inference plus checkpoint hotloading then lets you test whether the modes actually diverge behaviorally.

Reasoning ModesNon-think, Think High, and Think Max are behavior contracts backed by different prompt formats, budgets, reward shaping, and evaluation criteria.

4. Reward modeling becomes generative

For hard-to-verify tasks, DeepSeek says it discards conventional scalar reward models and uses a Generative Reward Model (GRM). The actor model itself acts as the judge, and RL optimizes that evaluative behavior alongside generation.

This matters because many valuable tasks are not easy to score with exact-match checks: writing, design, tool use, research synthesis, subjective quality, and long-horizon agent behavior. In those cases, evaluation looks more like deliberation than scalar prediction.

There are obvious risks: self-judging models can learn blind spots or reward hacking. But the direction is important. Reward is moving from a separately trained scalar reward model toward model-generated evaluative behavior optimized with RL.

Training platform design note: The relevant abstraction is a sample -> judge -> train loop. Generate candidates, call a verifier/tool/evaluator/model-as-judge, turn feedback into reward or custom loss, then update the policy. The platform should not need to know in advance whether feedback came from a unit test, rubric, human label, or generative judge.

Generative Reward LoopFor hard-to-score tasks, DeepSeek-V4 replaces a separate scalar reward model with generative reward modeling, where the actor model learns evaluative behavior alongside generation.

5. Specialists first, then on-policy distillation

DeepSeek does not train the final model with one mixed RL stage. It first trains domain specialists. For each domain, it starts with supervised fine-tuning on high-quality domain data, then runs GRPO with domain-specific prompts and reward signals.

Those specialists are then merged through On-Policy Distillation (OPD). The student samples its own rollouts, and more than ten teacher models provide target distributions on those trajectories. This avoids directly merging weights or hoping one composite reward can produce every behavior at once.

The expensive part is full-vocabulary logit distillation. Instead of matching only sampled tokens, DeepSeek matches the full teacher distribution. To make that feasible, it caches teacher last-layer hidden states and reconstructs logits on the fly with the relevant teacher head.

Training platform design note: The concrete loop is rollout sampling, teacher/reference forward passes, custom loss computation, optimizer steps, and checkpoint/weight sync back into deployment for evaluation. DeepSeek-style full-vocab OPD would need extra primitives for full logits or hidden-state streaming, but the outer loop is exactly the kind of thing a programmable training API should express.

On-Policy DistillationA student samples its own trajectories while specialist teachers provide target distributions on those same rollouts.

6. Agentic training pulls runtime into the loop

DeepSeek-V4 also changes how reasoning state is handled in tool-use conversations. For agentic workflows, it preserves reasoning traces across user/tool boundaries instead of flushing them after every user message. For ordinary chat, it still discards prior reasoning to keep context concise.

They also introduce Quick Instruction: special tokens appended to the main context for auxiliary decisions like search/no-search, query generation, domain classification, and URL reading. Instead of calling a separate controller model and paying redundant prefill, these decisions reuse the main model's KV cache.

Finally, agentic training needs sandboxes. DeepSeek describes a preemptible rollout service with token-granular write-ahead logs, and DeepSeek Elastic Compute (DSec) for sandbox execution. DSec keeps ordered trajectory logs of commands and results, which makes tool execution part of the training/evaluation record.

Training platform design note: For agentic RL, rollout execution is part of the learning system. Sampling, tool/evaluator integration, durable trajectory logs, checkpointing, and replayable execution environments matter as much as the optimizer.

Agentic Execution InfrastructureAgentic RL needs more than optimizer code: quick decisions, tool execution, durable trajectory logs, replayable sandboxes, and reward evaluation all sit in the loop.

What this adds up to

The through-line across these six pieces of DeepSeek-V4 is the same: important training decisions now reach into runtime, evaluation, and infrastructure. Architecture co-designs with serving, routing becomes a stability problem with side-channel state, and reasoning effort, reward, distillation, and tool use all turn into programmable loops over rollouts, judges, checkpoints, and execution logs.

That is the bet behind the Fireworks Training API: expose the loop, and handle the system underneath. Full-vocabulary OPD, historical-router caches, and DSec-style sandboxed rollout services would need more primitives than most platforms expose today, but the direction is hard to argue with: fixed training pipelines are giving way to programmable training systems.