Supervised fine-tune models
Supervised Fine-Tuning (SFT) adapts general-purpose models to domain-specific tasks, significantly improving performance in real-world applications. Fireworks’ fine-tuning service is easy to use, and supports continued training from another fine-tuned model. Fine-tuned models can be seamlessly deployed for inference and multi-LoRA serving allows multiple fine-tuned models to run simultaneously on a single deployment. You can run your supervised fine-tuning job via CLI, API or UI.
We’re introducing an upgraded tuning service with improved speed, usability and reliability! The new service utilizes different commands and model coverage. The new service is offered for free as we’re in public preview.
Benefits of fine-tuning:
- Higher Accuracy: Fine-tuning helps the model better match the dataset, boosting precision and performance.
- Better Fit for Specific Domains: Adapting general models with domain-specific data makes them more effective for specialized tasks.
- Less Bias: Using diverse, curated datasets during fine-tuning reduces built-in biases for fairer results.
- Up-to-Date Knowledge: Fine-tuning with new data keeps the model aligned with the latest information.
Fireworks uses LoRA-based fine-tuning to reduce the computational cost of fine-tuning large models by updating only a small subset of parameters in a low‑rank structure. For models with 70B or more parameters, qLoRA (quantized) to improve training speeds.
Impact on inference speed
For fast inference speeds, the fine-tuned LoRA should be merged into the base model. Note that fine-tuned model inference on Serverless is slower than base model inference on Serverless.
Fine-tuning a model
- Enhanced Precision: The model can adapt to the unique attributes and trends within a dataset, leading to significantly improved precision and effectiveness.
- Domain Adaptation: While many models are developed with general data, fine-tuning them with specialized, domain-specific datasets ensures they are finely attuned to the specific requirements of that field.
- Bias Reduction: General models may carry inherent biases. Fine-tuning with a well-curated, diverse dataset aids in reducing these biases, fostering fairer and more balanced outcomes.
- Contemporary Relevance: Information evolves rapidly, and fine-tuning with the latest data keeps the model current and relevant.
- Customization for Specific Applications: The model can be tailored to meet unique objectives and needs not achievable with standard models.
In essence, fine-tuning a model with a specific dataset is a pivotal step in ensuring its enhanced accuracy, relevance, and suitability for specific applications. Let’s try fine-tuning a model!
Fine-tuned model inference on Serverless is slower than base model inference on Serverless. For use cases that need low latency, we recommend using on-demand deployments. For on-demand deployements, fine-tuned model inference speeds are significant closer to base model speeds (but still slightly slower). If you are only using 1 LoRA on-demand, merging fine-tuned weights into the base model when using on-demand deployments will provide identical speed to base model inference. If you have an enterprise use case that needs fast fine-tuned models, please contact us!
Step 1: Check Available Models for Fine-Tuning
In the model library page, select the Tunable filter. Alternately, in the model page, check whether the “Fine Tuning” field is set to “supported” in the model’s details page.
Our new tuning service is currently free but will eventually be charged based on the total number of tokens processed (dataset_tokens * num_empochs
). Running inference on fine-tuned models incurs no extra costs outside of base inference fees.
Step 2: Prepare the Dataset
Datasets must be in JSONL format, where each line represents a complete JSON-formatted training example.
- Minimum examples needed: 3
- Maximum examples: Up to 3 million examples per dataset
- File format: JSONL (each line is a valid JSON object)
- Message Schema: Each training sample must include a
messages
array, where each message is an object with two fields:role
: one ofsystem
,user
, orassistant
. A message with the “system” role is optional, but if specified, it must be the first message of the conversationcontent
: a string representing the message content
This format conforms to OpenAI’s Chat Completions API
Here is an example conversation dataset:
Step 3: Create and Upload the Dataset
Create and check the dataset via CLI and API
Step 4: Creating a Fine-Tuning Job
Using CLI: To start a structured fine-tuning job (sftj), run the command below. This will also return the fine-tuning job ID.
For example:
To start a structured fine-tuning job (sftj), run the following bash/ Python code
firectl will return the fine-tuning job ID.
Step 5. Checking the job status
You can monitor the progress of the tuning job by running
Once the job successfully completes, a model will be created in your account. You can see a list of models, and then you use this id to monitor the status of the job::
Continue training from a fine-tuned model
When creating a fine-tuning job, you can start tuning from a base model, or from a fine-tuned model you tuned earlier:
- Base model: Use the
base-model
parameter in CLI (orbaseModel
in API) to start from a pre-trained base model. - Existing LoRA add-on: Use the
warm-start-from
parameter in CLI (orwarmStartFrom
parameter in API) to start from an existing LoRA addon model, where the LoRA is specified with the format “accounts/<account-id>/models/<addon-model-id>”
You must specify either base-model
or warm-start-from
parameter.
Deploying and using a model
Before using your fine-tuned model for inference, you must deploy it. Please refer to our guides on Deploying a model and Querying text models for detailed instructions.
Some base models may not support serverless addons. To check:
- Run
firectl -a fireworks get <BASE_MODEL_ID>
- Look under
Deployed Model Refs
to see if a Fireworks-owned deployment exists, e.g.accounts/fireworks/deployments/3c7a68b0
- If so, then it is supported
If the base model doesn’t support serverless addons, you will need use an on-demand deployment to deploy it.
Additional tuning options
Tuning settings are specified when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like epochs
and learning rate
, we recommend using default settings and only changing hyperparameters if results are not as desired. All tuning options must be specified via command line flags as shown in the below example:
Evaluation
By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that’s created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data.
evaluation_dataset
: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl
Early stopping
Early stopping stops training early in the validation loss does not improve. It is off by default
Max Context Length
By default, fine-tuned models support a max context length of 8k. Increase max context length if your use case requires context above 8k. Maximum context length can be increased up to the default context length of your selected model. For models with over 70B parameters, we only support up to 65536 max context length.
Epochs
Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported.
Note: we set a max value of 3 million dataset examples * epochs
Learning rate
Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value set is automatically based on your selected model.
Lora Rank
LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 64. Our default value is 8.
Training progress and monitoring
The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key.
Model ID
By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within ID constraints.
Job ID
By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID.
Turbo Mode
By default, the fine-tuning job will use a single GPU. You can optionally enable the turbo mode to accelerate with multiple GPUs (only for non-Deepseek models)
Downloading model weights
To download model weights run
Appendix
Supported base models - tuning
Using UI: In the model library page, select the Tunable filter. In the model page, check whether the “fine-tuning” field is set to “supported” in the model’s details page.
All models available for tuning also support LoRAs on their dedicated deployments, meaning that up to 100 LoRAs can be deployed to a dedicated instance for no extra fees compared to the base deployment costs. Some models support LoRAs on dedicated deployments even though Fireworks does not support tuning for these models. This means that users can tune a LoRA on a separate platform but upload this LoRA to Fireworks for inference.
Supported base models - LoRAs on serverless
Some serverless models support LoRA deployment, allowing up to 100 LoRAs can be deployed for inference that’s constantly available on a pay-per-token basis. The field for “Serverless LoRA Deployment” will be set to “supported” for these models in their model details page.