On this page
How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform.
Training trillion-parameter Mixture-of-Experts (MoE) models has historically been bottlenecked by memory walls and complex cluster orchestration. Earlier this month, Cursor released Composer 2 — a frontier coding model that tops CursorBench at 61.3, SWE-bench Multilingual at 73.7, and Terminal-Bench at 61.7. Fireworks powers the Reinforcement Learning (RL) inference infrastructure behind it, proving that these bottlenecks can be overcome at scale.
We have written about delta-compressed weight sync and multi-region rollout fleets, and about why numerical parity between training and inference is especially hard for MoE models. Those posts cover the inference half of the RL loop — rollouts, weight transfer, and numerical alignment.
This post covers the last missing piece: the trainer itself. Our Training SDK provides the model catalog, parallelism stack, precision kernels, and memory optimizations that make it possible to fine-tune trillion-parameter MoE models on current hardware.
Our Training Shapes catalog supports both LoRA and full-parameter training across models in the Fireworks catalog. Customers pick a shape ID and call resolve_training_profile() — the Training SDK and API backend handles GPU layout, parallelism, and deployment bring-up automatically. Teams that want to start with managed fine-tuning and graduate to custom training loops can do so on the same platform.
| Model | Architecture | Context | Hardware |
|---|---|---|---|
| Qwen3.5 397B-A17B | MoE | 262K | 32x B200 |
| Qwen3.5 35B-A3B | MoE | 262K | 8x B200 |
| Qwen3 235B | MoE | 128K | 16x B200 |
| Qwen3 32B | Dense | 65K | 8x B200 |
| Qwen3 30B-A3B | MoE | 128K | 8x B200 |
| Qwen3 8B | Dense | 128K | 4x B200 |
| Qwen3 VL 8B | Dense (Vision-Language) | 65K | 4x H200 |
| Kimi K2.5 | MoE | 256K | 64x B200 |
| Kimi K2.5 LoRA | MoE | 256K | 8x B300 |
| Llama 3.3 70B | Dense | 128K | 8x B200 |
| MiniMax M2.5 | MoE | 192K | 16x B200 |
| Nemotron 3 Super 120B | Hybrid Mamba-MoE | 128K | 16x B200 |
| Nemotron Nano 3 30B-A3B | Hybrid Mamba-MoE | 262K | 8x B200 |
Both policy trainer and forward-only reference shapes are available for every model, supporting full RL workflows with separate policy and reference deployments. This is, to our knowledge, the broadest set of fine-tunable frontier MoE models available on any training platform.
The two training modes present very different engineering challenges. LoRA freezes most of the model and updates a small set of low-rank adapters — the question is whether the full model even fits on a single node. Full-parameter training updates every weight — the question is how to distribute a trillion parameters, their gradients, their optimizer states, and their activations across a GPU cluster while keeping utilization high. We built the engine to handle both.
LoRA fine-tuning of a 1T MoE model sounds like it should be easy — only a fraction of parameters are trainable. But the frozen base model still has to live in GPU memory. Kimi K2.5 has 384 MoE experts; in bfloat16, those experts alone consume the majority of an 8-GPU node's memory before a single gradient is computed.
Low-precision expert quantization makes it fit. We store frozen expert weights in a reduced-precision packed format, cutting expert memory by roughly 4x. The experts are dequantized to bf16 on the fly during the forward pass; because they are frozen, there is no loss of gradient precision. For Kimi K2.5, this is the difference between needing multiple nodes and fitting on a single 8-GPU node.
Optimizer state offloading reclaims more headroom. Optimizer state offloading between CPU and GPU reclaims significant memory headroom. On a Qwen3-30B MoE model (128 experts, 8 H200 GPUs), this reduces peak GPU memory by over 40% with no loss in throughput. Training results are bit-identical to the non-offloaded baseline.
Multi-session LoRA lets multiple clients independently load and hot-swap different LoRA adapters on the same shared frozen base model at runtime. Base-only handles are available for efficient reference model logprob computation in RL workflows. We validate zero state leakage across rapid adapter switches with verified cross-GPU parity.
Full-parameter training updates every weight in the model — which means every weight also needs a gradient and a full set of optimizer states. For MoE architectures, expert counts that don't divide evenly into GPU counts create load-balancing challenges, and expert dispatch adds an all-to-all communication at every MoE layer. Dense models avoid the routing complexity but still hit memory walls at large scale. Making full-parameter training work across the range of architectures in our catalog — from 8B dense models on a single node to 1T MoE models on multi-node clusters — required solving problems across compute, memory, communication, and scheduling simultaneously.
No single parallelism strategy covers every model shape. Our engine composes four dimensions from a single configuration — FSDP, Pipeline Parallelism, Context Parallelism, and Expert Parallelism — each addressing a different bottleneck. The Training SDK selects the right combination for each model: a dense 8B model may need only FSDP, while a 1T MoE at 256K context uses all four.
We break down the workload across these dimensions:
Full-parameter training is compute- and memory-intensive, so numerical precision matters. For MoE expert computation, we use MXFP8 native grouped GEMMs that leverage Blackwell's block-scaled tensor core matrix multiplications — the hardware dequantizes during the systolic-array multiply, not in a separate kernel.
On DeepSeek V3-class expert shapes (32 experts per rank, 7168 hidden dimension, 2048 intermediate), this delivers a significant speedup over BF16 across both forward and backward passes. Across all tested configurations including Qwen3-235B shapes, the speedup is consistent while maintaining end-to-end numerical fidelity: symmetric KL divergence stays below 0.0063 for every configuration, well within our 0.01 acceptance threshold.
For attention, we integrate FA4 (CuTeDSL) kernels designed for Blackwell's SM100 architecture, handling the native Multi-head Latent Attention (MLA) shapes used by DeepSeek V3 and Kimi K2.5 — QK dimension 192, V dimension 128 — in both forward and backward passes without padding or reshaping. We collaborated with the community on the Flash Attention 4 backward kernel for these specific dimensions.
We also support FP8 Quantization-Aware Training (QAT), where fake-quantization operations in training exactly mirror the inference engine's math. Models trained with QAT deploy at reduced precision with matching behavior — no post-training quantization surprises.
Custom loss functions in the Training SDK use forward_backward_custom, which executes two model forward passes: one to extract per-token log-probabilities, and a second forward-backward pass that propagates gradients through a cross-entropy surrogate. This generality lets you implement any RL objective in Python — but it doubles the forward-pass cost.
For production RL algorithms — GRPO, DRO, DAPO, GSPO, CISPO, and standard SFT cross-entropy — we fuse the loss computation into the forward pass itself, eliminating the extra round trip.
from fireworks.training.sdk import FiretitanServiceClient
service = FiretitanServiceClient(base_url=endpoint, api_key=api_key)
policy = service.create_training_client(base_model=base_model)
# Custom path — arbitrary loss, 2 forward passes + 1 backward
def my_loss(data, logprobs_list):
loss = custom_objective(logprobs_list, advantages)
return loss, {"custom_metric": loss.item()}
result = policy.forward_backward_custom(data, my_loss)
# Built-in fused path — 1 forward + 1 backward, loss computed on-device
result = policy.forward_backward(data, "ppo", {"clip_low_threshold": 0.8,
"clip_high_threshold": 1.2})
On a Qwen3.5-35B MoE model running on 8 H200 GPUs:
| Algorithm | Speedup |
|---|---|
| GRPO | ~2x |
| DRO | ~1.7x |
| DAPO | ~1.4x |
| SFT | ~1.3x |
All fused losses are numerically identical to the two-pass reference at step one and remain within the noise floor of MoE routing non-determinism at subsequent steps.
Our Training API uses an HTTP-based interface where data items are sent to the trainer online. Standard pipeline parallelism implementations assume batch-oriented data loading — a mismatch with RL workloads where rollout data arrives asynchronously. We redesigned our pipeline schedule to begin execution as data arrives, eliminating the batch-accumulation bottleneck.
The result is up to an order-of-magnitude improvement in first-result latency for RL workloads, depending on model size and traffic pattern — the benefit is largest when input QPS is low relative to batch size, which is common in RL rollout settings. Loss parity is exact: the streaming schedule produces the same gradients as the accumulated batch.
Each model family required deep distributed-training engineering to bring up — and the numerical parity pitfalls we cataloged for MoE serving apply equally to training. Qwen3.5-35B alone required solving 9 distinct gradient correctness bugs across shared experts, router gates, GQA, and DeltaNet layers. Every model ships with SFT memorization validation confirming end-to-end numerical correctness.
The training shapes available today extend to 256K tokens of context. We are actively pushing that frontier.
We have validated training of trillion-parameter MoE models at over one million tokens of context on GB200 GPU clusters. To our knowledge, no other published system has demonstrated MoE training at this combination of model scale and context length. The closest comparisons:
| System | Total Params | Max Train Context | Architecture |
|---|---|---|---|
| DeepSeek V3 | 671B | 128K | MoE |
| Llama 3.1 | 405B | 128K | Dense |
| Qwen3-235B | 235B | 262K | MoE |
| Nemotron 3 Super | 120B | 1M | Hybrid Mamba-MoE |
While Nemotron 3 Super reaches 1M context, it does so at 120B total parameters — 8.5x smaller and built on a fundamentally different Mamba-Transformer hybrid architecture. DeepSeek V3 is closest in model scale but trains at 128K context, an order of magnitude shorter.
Getting a trillion-parameter MoE to train at 1M context required systematic bottleneck analysis across attention scheduling, collective communication, and memory management. Through a progression of optimizations, we achieved multi-fold throughput improvements over our initial baseline. This capability is on the path to becoming a production training shape.
FP8 attention is projected to deliver substantial further throughput gains at long context lengths, where attention dominates compute time. Combined with our MXFP8 expert kernels and QAT support, this continues to push the Pareto frontier of training speed versus numerical fidelity.
We are validating additional GPU topologies, characterizing the interconnect trade-offs for different training workloads to ensure customers get optimal performance regardless of cluster configuration.
Our training stack is validated internally on additional MoE and hybrid architectures beyond those listed above. As these models become commercially relevant, they will be added to the training shapes catalog. Alternative optimizers for expert-parallel MoE training are also in development.
Fireworks supports managed fine-tuning (SFT, DPO, and RFT through the API), the Training SDK for custom training loops, and bring-your-own-trainer setups for teams that want Fireworks-hosted rollout inference.
Ready to scale your MoE training? Start with the Training Shapes catalog or the Training SDK quickstart.
To talk through a training architecture or pair your trainer with Fireworks-hosted RL rollouts, email [email protected] or reach out on Discord or register your interest here.