On this page

Scaling and Optimizing Frontier Model Training

How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform.

From RL Rollouts to the Training Engine

Training trillion-parameter Mixture-of-Experts (MoE) models has historically been bottlenecked by memory walls and complex cluster orchestration. Earlier this month, Cursor released Composer 2 — a frontier coding model that tops CursorBench at 61.3, SWE-bench Multilingual at 73.7, and Terminal-Bench at 61.7. Fireworks powers the Reinforcement Learning (RL) inference infrastructure behind it, proving that these bottlenecks can be overcome at scale.

We have written about delta-compressed weight sync and multi-region rollout fleets, and about why numerical parity between training and inference is especially hard for MoE models. Those posts cover the inference half of the RL loop — rollouts, weight transfer, and numerical alignment.

This post covers the last missing piece: the trainer itself. Our Training SDK provides the model catalog, parallelism stack, precision kernels, and memory optimizations that make it possible to fine-tune trillion-parameter MoE models on current hardware.

What's Available Today

Our Training Shapes catalog supports both LoRA and full-parameter training across models in the Fireworks catalog. Customers pick a shape ID and call resolve_training_profile() — the Training SDK and API backend handles GPU layout, parallelism, and deployment bring-up automatically. Teams that want to start with managed fine-tuning and graduate to custom training loops can do so on the same platform.

Model	Architecture	Context	Hardware
Qwen3.5 397B-A17B	MoE	262K	32x B200
Qwen3.5 35B-A3B	MoE	262K	8x B200
Qwen3 235B	MoE	128K	16x B200
Qwen3 32B	Dense	65K	8x B200
Qwen3 30B-A3B	MoE	128K	8x B200
Qwen3 8B	Dense	128K	4x B200
Qwen3 VL 8B	Dense (Vision-Language)	65K	4x H200
Kimi K2.5	MoE	256K	64x B200
Kimi K2.5 LoRA	MoE	256K	8x B300
Llama 3.3 70B	Dense	128K	8x B200
MiniMax M2.5	MoE	192K	16x B200
Nemotron 3 Super 120B	Hybrid Mamba-MoE	128K	16x B200
Nemotron Nano 3 30B-A3B	Hybrid Mamba-MoE	262K	8x B200

Both policy trainer and forward-only reference shapes are available for every model, supporting full RL workflows with separate policy and reference deployments. This is, to our knowledge, the broadest set of fine-tunable frontier MoE models available on any training platform.

The two training modes present very different engineering challenges. LoRA freezes most of the model and updates a small set of low-rank adapters — the question is whether the full model even fits on a single node. Full-parameter training updates every weight — the question is how to distribute a trillion parameters, their gradients, their optimizer states, and their activations across a GPU cluster while keeping utilization high. We built the engine to handle both.

LoRA: Fitting a Trillion Parameters on One Node

LoRA fine-tuning of a 1T MoE model sounds like it should be easy — only a fraction of parameters are trainable. But the frozen base model still has to live in GPU memory. Kimi K2.5 has 384 MoE experts; in bfloat16, those experts alone consume the majority of an 8-GPU node's memory before a single gradient is computed.

Low-precision expert quantization makes it fit. We store frozen expert weights in a reduced-precision packed format, cutting expert memory by roughly 4x. The experts are dequantized to bf16 on the fly during the forward pass; because they are frozen, there is no loss of gradient precision. For Kimi K2.5, this is the difference between needing multiple nodes and fitting on a single 8-GPU node.

Optimizer state offloading reclaims more headroom. Optimizer state offloading between CPU and GPU reclaims significant memory headroom. On a Qwen3-30B MoE model (128 experts, 8 H200 GPUs), this reduces peak GPU memory by over 40% with no loss in throughput. Training results are bit-identical to the non-offloaded baseline.

Multi-session LoRA lets multiple clients independently load and hot-swap different LoRA adapters on the same shared frozen base model at runtime. Base-only handles are available for efficient reference model logprob computation in RL workflows. We validate zero state leakage across rapid adapter switches with verified cross-GPU parity.

Full-Parameter Training: Scaling Across a GPU Cluster

Full-parameter training updates every weight in the model — which means every weight also needs a gradient and a full set of optimizer states. For MoE architectures, expert counts that don't divide evenly into GPU counts create load-balancing challenges, and expert dispatch adds an all-to-all communication at every MoE layer. Dense models avoid the routing complexity but still hit memory walls at large scale. Making full-parameter training work across the range of architectures in our catalog — from 8B dense models on a single node to 1T MoE models on multi-node clusters — required solving problems across compute, memory, communication, and scheduling simultaneously.

Composable Parallelism

No single parallelism strategy covers every model shape. Our engine composes four dimensions from a single configuration — FSDP, Pipeline Parallelism, Context Parallelism, and Expert Parallelism — each addressing a different bottleneck. The Training SDK selects the right combination for each model: a dense 8B model may need only FSDP, while a 1T MoE at 256K context uses all four.

Composable ParallelismFour parallelism dimensions compose from a single configuration. FSDP shards parameters, pipeline parallelism splits layers, context parallelism shards the sequence, and expert parallelism routes MoE tokens.

We break down the workload across these dimensions:

Context Parallelism serves as the primary long-context scaling axis: it shards the sequence across nodes while preserving full hidden dimensions in projections, keeping matrix multiplications efficient.
Expert Parallelism uses DeepEP for high-throughput MoE token dispatch with minimal overhead.
For architectures that mix different attention mechanisms (e.g., full attention and linear attention), we support hybrid context parallelism that handles heterogeneous layer types within a single model, validated at 35B MoE scale with KL divergence below 0.008.

Blackwell-Native Precision

Full-parameter training is compute- and memory-intensive, so numerical precision matters. For MoE expert computation, we use MXFP8 native grouped GEMMs that leverage Blackwell's block-scaled tensor core matrix multiplications — the hardware dequantizes during the systolic-array multiply, not in a separate kernel.

On DeepSeek V3-class expert shapes (32 experts per rank, 7168 hidden dimension, 2048 intermediate), this delivers a significant speedup over BF16 across both forward and backward passes. Across all tested configurations including Qwen3-235B shapes, the speedup is consistent while maintaining end-to-end numerical fidelity: symmetric KL divergence stays below 0.0063 for every configuration, well within our 0.01 acceptance threshold.

For attention, we integrate FA4 (CuTeDSL) kernels designed for Blackwell's SM100 architecture, handling the native Multi-head Latent Attention (MLA) shapes used by DeepSeek V3 and Kimi K2.5 — QK dimension 192, V dimension 128 — in both forward and backward passes without padding or reshaping. We collaborated with the community on the Flash Attention 4 backward kernel for these specific dimensions.

We also support FP8 Quantization-Aware Training (QAT), where fake-quantization operations in training exactly mirror the inference engine's math. Models trained with QAT deploy at reduced precision with matching behavior — no post-training quantization surprises.

Optimized RL Loss Computation

Custom loss functions in the Training SDK use forward_backward_custom, which executes two model forward passes: one to extract per-token log-probabilities, and a second forward-backward pass that propagates gradients through a cross-entropy surrogate. This generality lets you implement any RL objective in Python — but it doubles the forward-pass cost.

For production RL algorithms — GRPO, DRO, DAPO, GSPO, CISPO, and standard SFT cross-entropy — we fuse the loss computation into the forward pass itself, eliminating the extra round trip.

from fireworks.training.sdk import FiretitanServiceClient

service = FiretitanServiceClient(base_url=endpoint, api_key=api_key)
policy = service.create_training_client(base_model=base_model)

# Custom path — arbitrary loss, 2 forward passes + 1 backward
def my_loss(data, logprobs_list):
    loss = custom_objective(logprobs_list, advantages)
    return loss, {"custom_metric": loss.item()}

result = policy.forward_backward_custom(data, my_loss)

# Built-in fused path — 1 forward + 1 backward, loss computed on-device
result = policy.forward_backward(data, "ppo", {"clip_low_threshold": 0.8,
                                                "clip_high_threshold": 1.2})

Fused RL Loss ComputationThe standard two-pass approach recomputes the full forward pass to obtain loss. The fused path computes loss directly in a single forward-backward pass, yielding up to 2x speedup for PPO.

On a Qwen3.5-35B MoE model running on 8 H200 GPUs:

Algorithm	Speedup
GRPO	~2x
DRO	~1.7x
DAPO	~1.4x
SFT	~1.3x

All fused losses are numerically identical to the two-pass reference at step one and remain within the noise floor of MoE routing non-determinism at subsequent steps.

Streaming Pipeline Parallelism

Our Training API uses an HTTP-based interface where data items are sent to the trainer online. Standard pipeline parallelism implementations assume batch-oriented data loading — a mismatch with RL workloads where rollout data arrives asynchronously. We redesigned our pipeline schedule to begin execution as data arrives, eliminating the batch-accumulation bottleneck.

Streaming Pipeline ParallelismAccumulated scheduling waits for a full batch before executing. Streaming execution begins immediately as data arrives, reducing first-result latency by over an order of magnitude.

The result is up to an order-of-magnitude improvement in first-result latency for RL workloads, depending on model size and traffic pattern — the benefit is largest when input QPS is low relative to batch size, which is common in RL rollout settings. Loss parity is exact: the streaming schedule produces the same gradients as the accumulated batch.

Architecture Coverage

Each model family required deep distributed-training engineering to bring up — and the numerical parity pitfalls we cataloged for MoE serving apply equally to training. Qwen3.5-35B alone required solving 9 distinct gradient correctness bugs across shared experts, router gates, GQA, and DeltaNet layers. Every model ships with SFT memorization validation confirming end-to-end numerical correctness.

What's Next

The training shapes available today extend to 256K tokens of context. We are actively pushing that frontier.

Ultra-Long Context Training

We have validated training of trillion-parameter MoE models at over one million tokens of context on GB200 GPU clusters. To our knowledge, no other published system has demonstrated MoE training at this combination of model scale and context length. The closest comparisons:

System	Total Params	Max Train Context	Architecture
DeepSeek V3	671B	128K	MoE
Llama 3.1	405B	128K	Dense
Qwen3-235B	235B	262K	MoE
Nemotron 3 Super	120B	1M	Hybrid Mamba-MoE

While Nemotron 3 Super reaches 1M context, it does so at 120B total parameters — 8.5x smaller and built on a fundamentally different Mamba-Transformer hybrid architecture. DeepSeek V3 is closest in model scale but trains at 128K context, an order of magnitude shorter.

Getting a trillion-parameter MoE to train at 1M context required systematic bottleneck analysis across attention scheduling, collective communication, and memory management. Through a progression of optimizations, we achieved multi-fold throughput improvements over our initial baseline. This capability is on the path to becoming a production training shape.

Next-Generation Precision

FP8 attention is projected to deliver substantial further throughput gains at long context lengths, where attention dominates compute time. Combined with our MXFP8 expert kernels and QAT support, this continues to push the Pareto frontier of training speed versus numerical fidelity.

Topology Expansion

We are validating additional GPU topologies, characterizing the interconnect trade-offs for different training workloads to ensure customers get optimal performance regardless of cluster configuration.

Expanding the Model Catalog

Our training stack is validated internally on additional MoE and hybrid architectures beyond those listed above. As these models become commercially relevant, they will be added to the training shapes catalog. Alternative optimizers for expert-parallel MoE training are also in development.

Start Training Today

Fireworks supports managed fine-tuning (SFT, DPO, and RFT through the API), the Training SDK for custom training loops, and bring-your-own-trainer setups for teams that want Fireworks-hosted rollout inference.

Ready to scale your MoE training? Start with the Training Shapes catalog or the Training SDK quickstart.

To talk through a training architecture or pair your trainer with Fireworks-hosted RL rollouts, email [email protected] or reach out on Discord or register your interest here.

Scaling and Optimizing Frontier Model Training

Scaling and Optimizing Frontier Model Training

From RL Rollouts to the Training Engine

What's Available Today

LoRA: Fitting a Trillion Parameters on One Node

Full-Parameter Training: Scaling Across a GPU Cluster

Composable Parallelism

Blackwell-Native Precision

Optimized RL Loss Computation

Streaming Pipeline Parallelism

Architecture Coverage

What's Next

Ultra-Long Context Training

Next-Generation Precision

Topology Expansion

Expanding the Model Catalog

Start Training Today

Further Reading