Excited to launch a multi-year partnership bringing Fireworks to Microsoft Azure Foundry! Learn more

Blog
Temp Introducing Training Platform Inference Fabric

[staged] Introducing The Inference Fabric: Own Your AI

flywheel

We've hit a wall with generic prompting. The era of renting closed models is ending.

Today, we are launching The Inference Fabric: a unified platform for fine-tuning, deploying, and scaling open-source models that lets you own your AI and achieve state-of-the-art performance.

Token consumption is exploding, but teams are waking up to a brutal reality: a cool demo doesn't equal a sustainable business. When you rely entirely on massive closed models, you are renting someone else's competitive advantage. Your unit economics are broken, you have zero portability, and your moat is non-existent as everyone is using the same generic model.

The next phase of AI isn't going to be about who can write the best prompt for a generic model. It's going to be about specialized, domain-specific agents that are deeply tuned to your specific data and your product. You can't build that future unless you actually own the post-training loop.


The Unified Advantage: The Flywheel

The real magic of The Inference Fabric is the unified workflow:

Train → Deploy → Monitor → Collect → Retrain.

flywheel-temp

Fragmented workflows kill iteration speed. Training a model in one environment and serving it in another creates massive friction. With The Inference Fabric, the gap between a trained checkpoint and a live production endpoint is zero.

A new checkpoint becomes a live deployment in seconds via WeightSyncer—you can test 10 different model versions in the time it would take just to provision a DIY cluster. Training and inference happen in the exact same place, so there's no exporting data, no migrating models, no messing with format conversions. And because we give you CUDA-level correctness guarantees, what you train is exactly what you deploy. No surprises. Finally, if you are an early-stage startup and can’t saturate a full node with a single LoRA, we support Multi-Lora deployments so that you can seamlessly scale up from experimentation to production across your PMF journey.


The Proof: Real Models, Real Workloads

Three myths keep teams stuck on closed models:

  1. Closed models are always better.
  2. Fine-tuning is too hard.
  3. You need a massive labeled dataset to get started.

Let's now dispel each of these three myths. We've been tirelessly working with some of the most advanced AI teams in the world to push open models past the capabilities of closed-source alternatives across three key methodologies: Reinforcement Learning (RL), Supervised Fine-Tuning (SFT), and Direct Preference Optimization (DPO).

RL: Unlocking True Agentic Behavior

SFT is great for formatting, but it hits a ceiling when you need a model to reason through multi-step problems, use tools reliably, or correct its own mistakes mid-execution. That's where Reinforcement Learning comes in, and it's where we've invested the most deeply.

Vercel used our RL infrastructure to build a custom "Auto Fix" model for v0, their code generation product. They needed a model that could constantly check the output stream for errors mid-flight and self-correct without a second pass. By applying RL on Fireworks, Vercel achieved a 93.87% error-free generation rate, significantly outperforming closed frontier models. Even better, they saw a 40X improvement in end-to-end latency compared to proprietary alternatives, processing over 8,000 characters per second.

"Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s." — Malte Ubl, CTO at Vercel

Genspark partnered with us to apply RL to Kimi K2, a 1 trillion parameter open-source model, for their deep research agents. Their agents need to conduct multi-source investigations and make complex, chained tool calls, the kind of behavior that prompt engineering alone simply can't teach. The RL-tuned model unlocked a 33% increase in tool calls (averaging 5 vs. 3.74) and pushed the reward quality score from 0.76 to 0.82, surpassing the state-of-the-art closed model at 50% lower cost.

"It performed comparably to SOTA closed source models in detailed human evaluations. It uses tools more frequently and skillfully, which we are excited to see."

— Flame Zhou, Research Engineer at Genspark

"Fireworks enabled us to own our AI journey, and unlock better quality in just four weeks."

— Kay Zhu, CTO at Genspark

And then there's Cursor. When the team behind Composer 2 needed to run RL rollouts at scale, they needed inference infrastructure that could flex dynamically between training and production traffic. Fireworks ran the RL inference for Cursor's Composer 2 training across 3–4 clusters worldwide, using our Virtual Cloud capacity to elastically share compute between RL rollouts and live production traffic.

"Our RL inference scales elastically and globally because of it. When we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

— Federico, Research at Cursor

This is the flywheel working in practice. Training and inference on the same infrastructure, sharing capacity dynamically, with no idle GPUs and no separate serving stack to manage. Cursor's team didn't need to build a bespoke distributed environment—they just used Fireworks.

You don't need a massive labeled dataset—your production traffic is your most valuable asset. With automated evals (using LLMs as judges), real user interactions become your training signal. Set up the flywheel once, and your model compounds in value every time a new base model drops.

Better model → Better product → More users → More data → Better model.


SFT: The Foundation That Beats Closed Models

RL gets the headlines, but most teams start here and the gains are immediate. A fine-tuned open model, trained on your data, beats a generic closed model on your specific task. Every time.

TaskModelF1 Score / AccuracyNotes
Customer SupportFine-tuned Qwen3 8B76.38%Beats leading closed model (69.40%)
Customer OperationsFine-tuned Qwen3 30B91.71%Nearly a 10-point gap over baseline
Ticket RoutingFine-tuned Qwen3 30B80.91%Beats Claude Haiku (61.47%) and Gemini Flash (71.93%). Zero invalid outputs vs 15% for Haiku. 2.5–20X faster.

DPO: Winning on Quality, Not Just Accuracy

SFT teaches a model what to say. DPO teaches it what not to say; that distinction matters enormously in production. For models that need to reason over structured data, make compliance judgments, or align tightly with domain-specific preferences, Direct Preference Optimization is what closes the gap between a capable model and a trustworthy one.

We ran head-to-head evaluations across 100 samples per task, evaluated by an independent LLM judge:

TaskWin Rate vs Base ModelWin Rate vs Gemini Flash
Search Entity Match (Structured JSON)98%63%
Data Compliance (Legal Verdict)87%69%

The same alignment principle applies at production scale. A leading AI healthcare company deployed Fireworks to power clinical summarization agents, using a full SFT + RL (DPO) pipeline to align outputs with physician preferences. Latency dropped ~4X and clinicians saved 2 hours per day, and the team went from zero to production in 4 weeks.

That's not a benchmark. That's a workflow transformation because the model was tuned to the specific structure and preferences of the clinical domain, not prompted against a generic closed model.


Infrastructure That Doesn't Lie to You

The quiet killer of DIY training is uncertainty. You often don't know if a run is converging until it's too late. Silent numerical bugs in custom CUDA kernels corrupt training without throwing errors, and spot instance failures wipe out hours of compute. We've invested heavily in making the training infrastructure itself trustworthy.

  • Numerical stability: We maintain CUDA-level correctness guarantees throughout training. We publish a full KLD (Kullback-Leibler Divergence) table across training and inference checkpoints, confirming that what you train is numerically identical to what you deploy. All values below 0.01 indicate production-grade fidelity. If your training and serving stacks have divergent numerical behavior, your evals are lying to you.
Modelk3 mean
Kimi-K2.50.002702
Qwen3.5-0.8B0.000525
Qwen3.5-4B (MoE)0.00215630
Qwen3.5-27B (MoE)0.0039736
Qwen3.5-35B-A3B (MoE)0.0037304
Qwen3.5-397B-A17B (MoE)0.0047603
Nemotron 3 Super 120B0.00536041
MiniMax-M2.50.0070
Llama 3.3 70B Instruct0.000125
Qwen3-4B0.000266
Qwen3-4B (LoRA)0.000266
  • Customized Loss: You are always in control of your use case. Off-the-shelf fine-tuning services often force you into rigid, one-size-fits-all training recipes that don't capture the nuances of your specific domain. The Inference Fabric gives you the flexibility to define custom loss functions, ensuring the model optimizes for the exact behaviors, guardrails, and reasoning paths your product demands.
  • Full Param & LoRA Support: There has been a lot of research on what LoRA is good at versus what full parameter is good at, for example this paper on LoRA learns less and forget less, and learns in different ways. Whether you need the lightweight agility and cost-effectiveness of LoRA adapters for rapid experimentation, or the deep, fundamental behavioral shifts that only full-parameter training can provide, our infrastructure supports both seamlessly. Just choose the right tool for the job.
  • Built on Open Model DNA: As the inference leader in the open model space, Fireworks handles more production traffic across Deepseek, Kimi, Qwen, GPT OSS etc. than anyone else. We've already tackled the hard problems at production scale: memory management, gradient checkpointing, and numerical stability. With The Inference Fabric, we aren't just tossing you a fine-tuning service; we're taking our deep, hard-won expertise with open models and productizing it so you get rock-solid infrastructure from day one.

One Platform, Three Paths

We designed The Inference Fabric to meet you exactly where you are.

1. Training Agent

Designed for: Product Managers & App Builders

Let's be honest: most teams stall out before training even begins. Data prep, model selection, and evals require serious ML chops. Fireworks Training Agent flips that script. It's an autonomous conversational agent that handles the entire post-training pipeline for you.

You just describe what you want in plain English and upload your raw data. Agent takes care of the formatting, picks the right base model, does hyperparms sweep, writes the evals, and deploys the model with a single click. It is hands-down the fastest, lowest-friction way to break out of vendor lock-in and move from closed models to ones you actually own.

"Fine-tuning has always been the domain of ML teams with the right infrastructure, the right tooling, and months to spare. Fireworks Agent changes that completely. For the first time, any team with data and a use case can own a model that's genuinely theirs, tuned to their product, their customers, their domain. Paired with our training infrastructure, we're not just making fine-tuning easier. We're making model ownership the default." — Lin Qiao, CEO, Fireworks AI

2. Managed Training

Designed for: ML Engineers & Platform Teams

If you already are familiar with fine tuning and hyperparams and know what you want, Managed Training gives you the reliable, scalable infrastructure to do it. You bring your formatted data and pick your method — SFT, DPO, or RFT. We handle the rest.

We built Managed Training specifically with agents in mind. Standard SFT tends to hit a ceiling on complex agentic workflows. But with our native RFT, you can define a reward function instead of manually writing thousands of perfect multi-turn examples. The model actually learns how to think ahead, use tools, and fix its own mistakes. We have often seen customers beat SOTA closed source models by combining SFT + DPO + RFT with LoRA adaptors. Full-parameter training is now supported, unlocking a new frontier for teams that need deep behavioral shifts.

3. Training API

Designed for: Advanced ML Platform Teams & Researchers

If you're pushing the frontier, DIY infrastructure on cloud providers is a tax on your research time. Training API gives you raw, unopinionated access to PyTorch and distributed training primitives, but without the DevOps tax. Whether you're running full-parameter tuning, writing custom loss functions, or doing cross-region Reinforcement Learning, Fireworks Training API instantly provisions highly optimized clusters (like 8x B200s) with automated, synchronous checkpointing. You get to focus on the science; we handle the brutal complexity of distributed training. Hook it up with your autoresearch setup with a local laptop guided training loop while we manage complex GPU infrastructure for you.

"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works. By seamlessly handling the heavy lifting of our fine-tuning pipelines, it allows us to iterate on new models at an incredible pace and dedicate our energy entirely to advancing our core research goals." - Kyle Montgomery and Sijun Tan - Core Contributors, rLLM


Stop Renting. Start Owning.

The future isn't a handful of massive, monolithic models everyone rents access to. It's millions of specialized agents—each one tuned to a specific domain, continuously improving, owned entirely by the team that built it.

Fireworks is the inference fabric that makes that future possible. Your model is your product. Your data is your moat. Activate your private data, align open models, and get world-class inference quality and speed.

Ready to Own Your AI?

Choose the path that fits your team:

  • Product Managers & App Builders: Start experimenting with Training Agent today
  • ML Engineers & Platform Teams: Read the docs for Managed Training
  • Advanced ML Teams & Researchers: Contact us to provision a Fuse cluster via Training API