Skip to main content
Direct Preference Optimization (DPO) fine-tunes models by training them on pairs of preferred and non-preferred responses to the same prompt. This teaches the model to generate more desirable outputs while reducing unwanted behaviors. Use DPO when:
  • Aligning model outputs with brand voice, tone, or style guidelines
  • Reducing hallucinations or incorrect reasoning patterns
  • Improving response quality where there’s no single “correct” answer
  • Teaching models to follow specific formatting or structural preferences

Fine-tuning with DPO

1

Prepare dataset

Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted training example.Minimum Requirements:
  • Minimum examples needed: 3
  • Maximum examples: Up to 3 million examples per dataset
  • File format: JSONL (each line is a valid JSON object)
  • Dataset Schema: Each training sample must include the following fields:
    • An input field containing a messages array, where each message is an object with two fields:
      • role: one of system, user, or assistant
      • content: a string representing the message content
    • A preferred_output field containing an assistant message with an ideal response
    • A non_preferred_output field containing an assistant message with a suboptimal response
Here’s an example conversation dataset (one training example):
einstein_dpo.jsonl
{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "What is Einstein famous for?"
      }
    ],
    "tools": []
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "Einstein is renowned for his theory of relativity, especially the equation E=mc²."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "He was a famous scientist."
    }
  ]
}
We currently only support one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message.
Save this dataset as jsonl file locally, for example einstein_dpo.jsonl.
2

Create and upload the dataset

There are a couple ways to upload the dataset to Fireworks platform for fine tuning: firectl, Restful API , builder SDK or UI.
  • UI
  • firectl
  • Restful API
  • You can simply navigate to the dataset tab, click Create Dataset and follow the wizard. Dataset Pn
While all of the above approaches should work, UI is more suitable for smaller datasets < 500MB while firectl might work better for bigger datasets.Ensure the dataset ID conforms to the resource id restrictions.
3

Create a DPO Job

Simple use firectl to create a new DPO job:
firectl create dpoj \
  --base-model accounts/account-id/models/base-model-id \
  --dataset accounts/my-account-id/datasets/my-dataset-id \
  --output-model new-model-id
for our example, we might run the following command:
firectl create dpoj \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset accounts/pyroworks/datasets/einstein-dpo \
  --output-model einstein-dpo-model
to fine-tune a Llama 3.1 8b Instruct model with our Einstein dataset.
4

Monitor the DPO Job

Use firectl to monitor progress updates for the DPO fine-tuning job.
firectl get dpoj dpo-job-id
Once the job is complete, the STATE will be set to JOB_STATE_COMPLETED, and the fine-tuned model can be deployed.
5

Deploy the DPO fine-tuned model

Once training completes, you can create a deployment to interact with the fine-tuned model. Refer to deploying a fine-tuned model for more details.

Next Steps

Explore other fine-tuning methods to improve model output for different use cases.