
Today, we are launching The Inference Fabric: a unified platform for fine-tuning, deploying, and scaling open-source models that lets you own your AI and achieve state-of-the-art performance.
Token consumption is exploding, but teams are waking up to a brutal reality: a cool demo doesn't equal a sustainable business. When you rely entirely on massive closed models, you are renting someone else's competitive advantage. Your unit economics are broken, you have zero portability, and your moat is non-existent as everyone is using the same generic model.
The next phase of AI isn't going to be about who can write the best prompt for a generic model. It's going to be about specialized, domain-specific agents that are deeply tuned to your specific data and your product. You can't build that future unless you actually own the post-training loop.
The real magic of The Inference Fabric is the unified workflow:
Train → Deploy → Monitor → Collect → Retrain.

Fragmented workflows kill iteration speed. Training a model in one environment and serving it in another creates massive friction. With The Inference Fabric, the gap between a trained checkpoint and a live production endpoint is zero.
A new checkpoint becomes a live deployment in seconds via WeightSyncer—you can test 10 different model versions in the time it would take just to provision a DIY cluster. Training and inference happen in the exact same place, so there's no exporting data, no migrating models, no messing with format conversions. And because we give you CUDA-level correctness guarantees, what you train is exactly what you deploy. No surprises. Finally, if you are an early-stage startup and can’t saturate a full node with a single LoRA, we support Multi-Lora deployments so that you can seamlessly scale up from experimentation to production across your PMF journey.
Three myths keep teams stuck on closed models:
Let's now dispel each of these three myths. We've been tirelessly working with some of the most advanced AI teams in the world to push open models past the capabilities of closed-source alternatives across three key methodologies: Reinforcement Learning (RL), Supervised Fine-Tuning (SFT), and Direct Preference Optimization (DPO).
SFT is great for formatting, but it hits a ceiling when you need a model to reason through multi-step problems, use tools reliably, or correct its own mistakes mid-execution. That's where Reinforcement Learning comes in, and it's where we've invested the most deeply.
Vercel used our RL infrastructure to build a custom "Auto Fix" model for v0, their code generation product. They needed a model that could constantly check the output stream for errors mid-flight and self-correct without a second pass. By applying RL on Fireworks, Vercel achieved a 93.87% error-free generation rate, significantly outperforming closed frontier models. Even better, they saw a 40X improvement in end-to-end latency compared to proprietary alternatives, processing over 8,000 characters per second.
"Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s." — Malte Ubl, CTO at Vercel
Genspark partnered with us to apply RL to Kimi K2, a 1 trillion parameter open-source model, for their deep research agents. Their agents need to conduct multi-source investigations and make complex, chained tool calls, the kind of behavior that prompt engineering alone simply can't teach. The RL-tuned model unlocked a 33% increase in tool calls (averaging 5 vs. 3.74) and pushed the reward quality score from 0.76 to 0.82, surpassing the state-of-the-art closed model at 50% lower cost.
"It performed comparably to SOTA closed source models in detailed human evaluations. It uses tools more frequently and skillfully, which we are excited to see."
— Flame Zhou, Research Engineer at Genspark
"Fireworks enabled us to own our AI journey, and unlock better quality in just four weeks."
— Kay Zhu, CTO at Genspark
And then there's Cursor. When the team behind Composer 2 needed to run RL rollouts at scale, they needed inference infrastructure that could flex dynamically between training and production traffic. Fireworks ran the RL inference for Cursor's Composer 2 training across 3–4 clusters worldwide, using our Virtual Cloud capacity to elastically share compute between RL rollouts and live production traffic.
"Our RL inference scales elastically and globally because of it. When we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."
— Federico, Research at Cursor
This is the flywheel working in practice. Training and inference on the same infrastructure, sharing capacity dynamically, with no idle GPUs and no separate serving stack to manage. Cursor's team didn't need to build a bespoke distributed environment—they just used Fireworks.
You don't need a massive labeled dataset—your production traffic is your most valuable asset. With automated evals (using LLMs as judges), real user interactions become your training signal. Set up the flywheel once, and your model compounds in value every time a new base model drops.
Better model → Better product → More users → More data → Better model.
RL gets the headlines, but most teams start here and the gains are immediate. A fine-tuned open model, trained on your data, beats a generic closed model on your specific task. Every time.
| Task | Model | F1 Score / Accuracy | Notes |
|---|---|---|---|
| Customer Support | Fine-tuned Qwen3 8B | 76.38% | Beats leading closed model (69.40%) |
| Customer Operations | Fine-tuned Qwen3 30B | 91.71% | Nearly a 10-point gap over baseline |
| Ticket Routing | Fine-tuned Qwen3 30B | 80.91% | Beats Claude Haiku (61.47%) and Gemini Flash (71.93%). Zero invalid outputs vs 15% for Haiku. 2.5–20X faster. |
SFT teaches a model what to say. DPO teaches it what not to say; that distinction matters enormously in production. For models that need to reason over structured data, make compliance judgments, or align tightly with domain-specific preferences, Direct Preference Optimization is what closes the gap between a capable model and a trustworthy one.
We ran head-to-head evaluations across 100 samples per task, evaluated by an independent LLM judge:
| Task | Win Rate vs Base Model | Win Rate vs Gemini Flash |
|---|---|---|
| Search Entity Match (Structured JSON) | 98% | 63% |
| Data Compliance (Legal Verdict) | 87% | 69% |
The same alignment principle applies at production scale. A leading AI healthcare company deployed Fireworks to power clinical summarization agents, using a full SFT + RL (DPO) pipeline to align outputs with physician preferences. Latency dropped ~4X and clinicians saved 2 hours per day, and the team went from zero to production in 4 weeks.
That's not a benchmark. That's a workflow transformation because the model was tuned to the specific structure and preferences of the clinical domain, not prompted against a generic closed model.
The quiet killer of DIY training is uncertainty. You often don't know if a run is converging until it's too late. Silent numerical bugs in custom CUDA kernels corrupt training without throwing errors, and spot instance failures wipe out hours of compute. We've invested heavily in making the training infrastructure itself trustworthy.
| Model | k3 mean |
|---|---|
| Kimi-K2.5 | 0.002702 |
| Qwen3.5-0.8B | 0.000525 |
| Qwen3.5-4B (MoE) | 0.00215630 |
| Qwen3.5-27B (MoE) | 0.0039736 |
| Qwen3.5-35B-A3B (MoE) | 0.0037304 |
| Qwen3.5-397B-A17B (MoE) | 0.0047603 |
| Nemotron 3 Super 120B | 0.00536041 |
| MiniMax-M2.5 | 0.0070 |
| Llama 3.3 70B Instruct | 0.000125 |
| Qwen3-4B | 0.000266 |
| Qwen3-4B (LoRA) | 0.000266 |
We designed The Inference Fabric to meet you exactly where you are.
Designed for: Product Managers & App Builders
Let's be honest: most teams stall out before training even begins. Data prep, model selection, and evals require serious ML chops. Fireworks Training Agent flips that script. It's an autonomous conversational agent that handles the entire post-training pipeline for you.
You just describe what you want in plain English and upload your raw data. Agent takes care of the formatting, picks the right base model, does hyperparms sweep, writes the evals, and deploys the model with a single click. It is hands-down the fastest, lowest-friction way to break out of vendor lock-in and move from closed models to ones you actually own.
"Fine-tuning has always been the domain of ML teams with the right infrastructure, the right tooling, and months to spare. Fireworks Agent changes that completely. For the first time, any team with data and a use case can own a model that's genuinely theirs, tuned to their product, their customers, their domain. Paired with our training infrastructure, we're not just making fine-tuning easier. We're making model ownership the default." — Lin Qiao, CEO, Fireworks AI
Designed for: ML Engineers & Platform Teams
If you already are familiar with fine tuning and hyperparams and know what you want, Managed Training gives you the reliable, scalable infrastructure to do it. You bring your formatted data and pick your method — SFT, DPO, or RFT. We handle the rest.
We built Managed Training specifically with agents in mind. Standard SFT tends to hit a ceiling on complex agentic workflows. But with our native RFT, you can define a reward function instead of manually writing thousands of perfect multi-turn examples. The model actually learns how to think ahead, use tools, and fix its own mistakes. We have often seen customers beat SOTA closed source models by combining SFT + DPO + RFT with LoRA adaptors. Full-parameter training is now supported, unlocking a new frontier for teams that need deep behavioral shifts.
Designed for: Advanced ML Platform Teams & Researchers
If you're pushing the frontier, DIY infrastructure on cloud providers is a tax on your research time. Training API gives you raw, unopinionated access to PyTorch and distributed training primitives, but without the DevOps tax. Whether you're running full-parameter tuning, writing custom loss functions, or doing cross-region Reinforcement Learning, Fireworks Training API instantly provisions highly optimized clusters (like 8x B200s) with automated, synchronous checkpointing. You get to focus on the science; we handle the brutal complexity of distributed training. Hook it up with your autoresearch setup with a local laptop guided training loop while we manage complex GPU infrastructure for you.
"The rLLM team is dedicated to pushing the boundaries of autonomous AI, which means our time is best spent on innovation rather than managing backend clusters. The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works. By seamlessly handling the heavy lifting of our fine-tuning pipelines, it allows us to iterate on new models at an incredible pace and dedicate our energy entirely to advancing our core research goals." - Kyle Montgomery and Sijun Tan - Core Contributors, rLLM
The future isn't a handful of massive, monolithic models everyone rents access to. It's millions of specialized agents—each one tuned to a specific domain, continuously improving, owned entirely by the team that built it.
Fireworks is the inference fabric that makes that future possible. Your model is your product. Your data is your moat. Activate your private data, align open models, and get world-class inference quality and speed.
Ready to Own Your AI?
Choose the path that fits your team: