Direct Preference Optimization

Direct Preference Optimization (DPO) fine-tunes models by training them on pairs of preferred and non-preferred responses to the same prompt. This teaches the model to generate more desirable outputs while reducing unwanted behaviors. Use DPO when:

Aligning model outputs with brand voice, tone, or style guidelines
Reducing hallucinations or incorrect reasoning patterns
Improving response quality where there’s no single “correct” answer
Teaching models to follow specific formatting or structural preferences

Fine-tuning with DPO

Prepare dataset

Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted training example.Minimum Requirements:

Minimum examples needed: 3
Maximum examples: Up to 3 million examples per dataset
File format: JSONL (each line is a valid JSON object)
Dataset Schema: Each training sample must include the following fields:
- An input field containing a messages array, where each message is an object with two fields:
  - role: one of system, user, or assistant
  - content: a string representing the message content
- A preferred_output field containing an assistant message with an ideal response
- A non_preferred_output field containing an assistant message with a suboptimal response

Here’s an example conversation dataset (one training example):

einstein_dpo.jsonl

{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "What is Einstein famous for?"
      }
    ],
    "tools": []
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "Einstein is renowned for his theory of relativity, especially the equation E=mc²."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "He was a famous scientist."
    }
  ]
}

We currently only support one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message.

Save this dataset as jsonl file locally, for example einstein_dpo.jsonl.

Create and upload the dataset

There are a couple ways to upload the dataset to Fireworks platform for fine tuning: firectl, Restful API , builder SDK or UI.

UI
firectl
Restful API

You can simply navigate to the dataset tab, click Create Dataset and follow the wizard.

While all of the above approaches should work, UI is more suitable for smaller datasets < 500MB while firectl might work better for bigger datasets.Ensure the dataset ID conforms to the resource id restrictions.

Create a DPO Job

Simple use firectl to create a new DPO job:

firectl create dpoj \
  --base-model accounts/account-id/models/base-model-id \
  --dataset accounts/my-account-id/datasets/my-dataset-id \
  --output-model new-model-id

for our example, we might run the following command:

firectl create dpoj \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset accounts/pyroworks/datasets/einstein-dpo \
  --output-model einstein-dpo-model

to fine-tune a Llama 3.1 8b Instruct model with our Einstein dataset.

Monitor the DPO Job

Use firectl to monitor progress updates for the DPO fine-tuning job.

firectl get dpoj dpo-job-id

Once the job is complete, the STATE will be set to JOB_STATE_COMPLETED, and the fine-tuned model can be deployed.

Deploy the DPO fine-tuned model

Once training completes, you can create a deployment to interact with the fine-tuned model. Refer to deploying a fine-tuned model for more details.

Next Steps

Explore other fine-tuning methods to improve model output for different use cases.

Supervised Fine Tuning - Text

Train models on input-output examples to improve task-specific performance.

Reinforcement Fine Tuning

Optimize models using AI feedback for complex reasoning and decision-making.

Supervised Fine Tuning - Vision

Fine-tune vision-language models to understand both images and text.

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Direct Preference Optimization

Fine-tuning with DPO

Next Steps

Supervised Fine Tuning - Text

Reinforcement Fine Tuning

Supervised Fine Tuning - Vision

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Fine-tuning with DPO

​Next Steps

Supervised Fine Tuning - Text

Reinforcement Fine Tuning

Supervised Fine Tuning - Vision

Fine-tuning with DPO

Next Steps