DeepSeek R1 0528, an updated version of the state-of-the-art DeepSeek R1 model, is now available. Try it now!

Blog
Reinforcement Fine Tuning

Reinforcement Fine Tuning (Beta): Train expert open models to surpass closed frontier models

Reinforcement fine tuning announcement

Today, we’re excited to announce the beta release of Reinforcement Fine-Tuning (RFT), a powerful new technique to create expert models for complex tasks across agentic reasoning, function calling, coding, and more. RFT can improve model quality with just a few examples.

Compared with closed frontier models, our alpha users have been able to train open models to:

  • Match or surpass the quality of closed frontier models
  • Increase speeds by 10X - 40X

Fireworks makes it easy to train expert models with RFT, by specifying an evaluator function that grades model outputs, with no infrastructure setup required! RFT on Fireworks supports frontier open models like Llama, Phi3/4, Qwen 2.5/3 and even DeepSeek V3 and R1.

You can get started here, and train models up to 10B parameters self-serve. Contact us if you want to use RFT to train larger models for production or research.

Training models using RFT on Fireworks is free of charge for the next 2 weeks!

How Reinforcement Fine Tuning works

RFT works best for tasks with clear answers that can be graded or verified for correctness, by building on the concept of Reinforcement Learning with Verifiable Reward (RLVR).

For example, when building agents that use tools, you can grade the quality of the model’s response based on an exact match with the expected function call, or approximate match based on fuzzy grading, or by using a more powerful language model as a judge. With RFT, you can specify such graders to teach the model to improve at your specific task.

While RL-based post-training can be extremely powerful, it requires significant effort to move beyond small scale experiments. Traditionally, building a reinforcement learning setup for large language models has been painful. You’d need to manage expensive GPU infrastructure, configure complex training loops, wrangle fragile dependencies across libraries, and build supporting developer tools just to get started.

With RFT, we’ve dramatically simplified the process. All you need to do is write a Python function that scores model outputs, everything else from job orchestration to training and experiment management—is handled for you. RFT brings the power of RLVR within reach.

Example of RFT evaluation function
An evaluator function for RFT, written in Python

Success Stories with RFT

Agentic Reasoning and Function calling

Customer service AI agents involve complex domain reasoning and a deep understanding of available tools and how to use them, across multi-turn conversations.

A fine-tuned 32B open model trained with Fireworks, was able to exceed the quality of GPT-4o on customer evals, using a combination of SFT and RFT. The trained model was able to transition between regular conversation and generate precise function call signatures that required following included complex constraints and instructions. The RFT evaluator compared the model’s generated function call with the labeled ground truth dataset.

Agentic reasoning and tool use: Fireworks SFT + RFT on open models outperforms GPT 4o

Code Generation

Code generated by frontier AI models often have execution-time errors that result in a poor user experience. Calling the frontier model again to fix the code is slow and expensive.

Vercel and Fireworks collaborated to train a custom code fixing model using RFT that matched the quality of closed models while running 10 - 40 times faster!

AI generated code can be compiled and the success or failure of compilation provides a natural evaluator. However, laziness and reward hacking present challenges, and an evaluator must account for these factors as well. RFT is well suited to this sort of multi-objective quality optimization where both accuracy and reliability are essential.

vercel-autofixer-01 was designed to quickly apply fixes while reducing error rates. On our error incidence evals, it performs at par with gpt-4o-mini and gemini-2.5-flash, while running 10 to 40 times faster.

Vercel v0 Team

Comparing various models


As the number of epochs trained increase, errors reduce across different error types
As the number of epochs trained increase, errors reduce across different error types

Creative Writing using LLM as a judge

When grading an answer requires subjective judgment, we can use a powerful LLM as an evaluator.

A fine-tuned 32B open model for creative writing was able to win 75% of the time versus the base model on Arena-Hard V2 tasks by using LLM as a judge as the evaluator.

To train the model, we first split the data into training, validation, and a held-out test set. During training, the model generated pairs of rollouts for each batch, which were then compared by a judge model to guide reinforcement learning updates. Final evaluation was conducted on the test set using the Arena Hard V2.0 framework and a large frontier models as a judge.

Creative writing: RFT model using LLM as judge wins 75% of the time vs non-RFT base model

Using reward-kit for authoring reward functions

We strongly believe that you need maximum flexibility to describe your reward function, because everyone’s environment is going to be different. Reward-kit supports arbitrary python code for you to define your reward function. You can also check out our examples to see what would be some good starting points, for example we have the function calling example and code execution example similar to the examples we showed in the blog earlier. We would love to learn what kind of patterns you want to build, so please reach out to us on Discord if you have any interesting ideas that you would like to explore with us.

Get started with RFT on Fireworks

You can train a wide range of open models on Fireworks, including Llama, Qwen and DeepSeek dense models, and frontier mixture-of-experts open models like DeepSeek V3 and R1.

Get started here to train models up to 10B parameters self-serve. Contact us if you want to use RFT to train larger models for production or research. We’re excited to see what you build!