Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets
Prepare your vision dataset
.jsonl filemessages array where each message has:role: one of system, user, or assistantcontent: an array containing text and image objects or just textdata:image/jpeg;base64, prefix followed by the base64 encoded image data.Python script to download and encode images to base64
<think></think> tags for reasoning.Upload your VLM dataset
firectl as it handles large uploads more reliably than the web interface.Launch VLM fine-tuning job
Monitor training progress

COMPLETED and your custom model is ready for deployment.Deploy your fine-tuned VLM
<think></think> tags in its response.